Reader small image

You're reading from  Vector Search for Practitioners with Elastic

Product typeBook
Published inNov 2023
PublisherPackt
ISBN-139781805121022
Edition1st Edition
Right arrow
Authors (2):
Bahaaldine Azarmi
Bahaaldine Azarmi
author image
Bahaaldine Azarmi

Bahaaldine Azarmi, Global VP Customer Engineering at Elastic, guides companies as they leverage data architecture, distributed systems, machine learning, and generative AI. He leads the customer engineering team, focusing on cloud consumption, and is passionate about sharing knowledge to build and inspire a community skilled in AI.
Read more about Bahaaldine Azarmi

Jeff Vestal
Jeff Vestal
author image
Jeff Vestal

Jeff Vestal has a rich background spanning over a decade in financial trading firms and extensive experience with Elasticsearch. He offers a unique blend of operational acumen, engineering skills, and machine learning expertise. As a Principal Customer Enterprise Architect, he excels at crafting innovative solutions, leveraging Elasticsearch's advanced search capabilities, machine learning features, and generative AI integrations, adeptly guiding users to transform complex data challenges into actionable insights.
Read more about Jeff Vestal

View More author details
Right arrow

The Power of Vectors and Embedding in Bolstering Cybersecurity

In the face of ever-evolving cybersecurity threats, a constant influx of information demands innovative tools and methods for sifting through vast datasets. The challenge becomes particularly daunting when determining the nuances and intentions behind the text.

For instance, as phishing attacks are becoming increasingly sophisticated, how can we identify what is malicious from what is benign, especially when they seem so alike? Enter Elastic Learned Sparse EncodeR (ELSER): a potent tool designed to understand text at a semantic level and discern the patterns and intents underneath the surface.

This chapter dives deep into ELSER, a pre-trained model provided by Elastic that has harnessed the power of vectors without burdening the user with its intricacies. We will address the following key topics:

  • Overview of ELSER, where we will delve into its essence and role in semantic search
  • Handling data with ELSER...

Technical requirements

In this chapter, you are going to set up your Elastic environment to use ELSER. For this, you will need to create an Elastic Cloud account at https://cloud.elastic.co/registration.

ELSER is a commercial feature under the platinum license. The good news is that Elastic provides a trial period to test it. Throughout your trial, you will be able to ask the Customer Engineers team questions through the chatbot, and they’ll guide you through the experience.

Understanding the importance of email phishing detection

Before getting into ELSER and its application in cybersecurity, we will understand phishing thoroughly and then move on to more advanced techniques with semantic search.

What is phishing?

Phishing is a common type of cyber-attack that involves disguising oneself as a reliable entity in electronic communication to gain sensitive information such as credentials, such as usernames and passwords, and payment information such as credit card or social security numbers.

Email spoofing is the main method used to phish. With the increase in communication apps, phishing also happens on platforms that offer instant messaging and often directs users to enter personal information on a fake website that matches the look and feel of a legitimate site.

Phishing directly affects thousands of people each day. Cybercriminals use social engineering techniques to trick unsuspecting individuals and organizations into giving up sensitive...

Introducing ELSER

ELSER is a groundbreaking tool that brings the power of machine learning to semantic search. It’s capable of discerning the underlying meaning and intent of the text. This is particularly valuable in tasks such as email phishing detection, where understanding the content of an email is crucial for identifying threats.

Imagine trying to understand a corpus of text. Traditional methods might involve painstakingly analyzing each sentence, looking up unfamiliar words, and trying to piece together the overall meaning. This can be a slow and laborious process. ELSER, on the other hand, can instantly provide a detailed analysis of the text, highlighting the key themes and explaining the subtle nuances. What ELSER does uniquely is perform text expansion, creating a set of tokens that form a semantic space, allowing for a richer understanding of any text field it processes.

One of the standout features of ELSER is its user-friendly nature. ELSER offers an out...

The role of ELSER in GenAI

ELSER is not just a tool in isolation; it’s part of a broader movement in the tech industry toward democratizing advanced technologies such as vector search, large language models (LLMs), and generative artificial intelligence (GenAI). It’s a bit like the advent of personal computers in the 1980s, which brought computing power into the hands of everyday users, sparking a revolution in how we work, communicate, and entertain ourselves.

Tools such as ELSER are making advanced AI capabilities accessible to a wider range of users. Vector search, which involves converting text into high-dimensional vectors and searching for similar vectors, was once a complex process that required specialized knowledge and resources. Now, with ELSER, users can leverage the power of vector search without needing to understand the underlying complexities.

LLMs such as GPT have made headlines with their ability to generate human-like text, but their use has been...

Introduction to the Enron email dataset (ham or spam)

The Enron dataset is a large collection of email data that has become a staple in the world of text analysis and machine learning. It’s like a vast library, filled with a diverse range of texts that offers a wealth of insights for those who know how to interpret them.

This dataset was originally made public during the legal investigation into Enron Corporation, a US energy company that collapsed in 2001 due to widespread corporate fraud. The dataset contains over 600,000 emails from about 150 users, mostly senior management of Enron, making it one of the only publicly available collections of real emails of its size.

For our purposes, the emails contained in the Enron dataset have been labeled as ham (legitimate) or spam (phishing). This labeling provides a valuable ground truth, allowing us to train and test models for phishing detection. Labeling tells us which emails are safe and which are dangerous, helping us to...

Seeing ELSER in action

In this part, we are going to walk you through how easy it is to get started with ELSER and see some significant results right out of the box. The first part will be to go through the required hardware. Then, we will look at preparing the index; finally, we are going to fire a couple of queries to illustrate the power of ELSER.

Hardware consideration

The ELSER documentation (https://www.elastic.co/guide/en/machine-learning/current/ml-nlp-elser.html#elser-hw-benchamrks) goes through benchmarks in representative data, which highlights KPIs such as inference, indexing, query, and latency.

You will see there that the hardware configuration significantly impacts the performance of ELSER. Here are some key takeaways to consider for sizing your infrastructure:

  • CPU and memory: The more powerful the ML node (in terms of CPU and memory), the better the performance. For instance, an ML node with 16 GB of memory and 8 vCPUs performs better than one with...

Summary

In this chapter, we introduced the opportunity to use a pre-trained model called ELSER, which leverages vectors without users having to manage the vectorization process. It’s an out-of-the-box model that generates immediate value from a semantic search perspective. We applied ELSER to the challenge of phishing attacks with the task in mind being to limit the impact of such attacks. You should now be able to build your pipeline to load data in Elasticsearch and start building applications that leverage ELSER, whether in cybersecurity or beyond.

In the next chapter, we are going to go a step further in leveraging vectors by building a retrieval augmented generation application.

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Vector Search for Practitioners with Elastic
Published in: Nov 2023Publisher: PacktISBN-13: 9781805121022
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Authors (2)

author image
Bahaaldine Azarmi

Bahaaldine Azarmi, Global VP Customer Engineering at Elastic, guides companies as they leverage data architecture, distributed systems, machine learning, and generative AI. He leads the customer engineering team, focusing on cloud consumption, and is passionate about sharing knowledge to build and inspire a community skilled in AI.
Read more about Bahaaldine Azarmi

author image
Jeff Vestal

Jeff Vestal has a rich background spanning over a decade in financial trading firms and extensive experience with Elasticsearch. He offers a unique blend of operational acumen, engineering skills, and machine learning expertise. As a Principal Customer Enterprise Architect, he excels at crafting innovative solutions, leveraging Elasticsearch's advanced search capabilities, machine learning features, and generative AI integrations, adeptly guiding users to transform complex data challenges into actionable insights.
Read more about Jeff Vestal