Reader small image

You're reading from  Machine Learning with the Elastic Stack - Second Edition

Product typeBook
Published inMay 2021
Reading LevelBeginner
PublisherPackt
ISBN-139781801070034
Edition2nd Edition
Languages
Right arrow
Authors (3):
Rich Collier
Rich Collier
author image
Rich Collier

Rich Collier is a solutions architect at Elastic. Joining the Elastic team from the Prelert acquisition, Rich has over 20 years' experience as a solutions architect and pre-sales systems engineer for software, hardware, and service-based solutions. Rich's technical specialties include big data analytics, machine learning, anomaly detection, threat detection, security operations, application performance management, web applications, and contact center technologies. Rich is based in Boston, Massachusetts.
Read more about Rich Collier

Camilla Montonen
Camilla Montonen
author image
Camilla Montonen

Camilla Montonen is a Senior Machine Learning Engineer at Elastic.
Read more about Camilla Montonen

Bahaaldine Azarmi
Bahaaldine Azarmi
author image
Bahaaldine Azarmi

Bahaaldine Azarmi, Global VP Customer Engineering at Elastic, guides companies as they leverage data architecture, distributed systems, machine learning, and generative AI. He leads the customer engineering team, focusing on cloud consumption, and is passionate about sharing knowledge to build and inspire a community skilled in AI.
Read more about Bahaaldine Azarmi

View More author details
Right arrow

Chapter 9: Introducing Data Frame Analytics

In the first section of this book, we took an in-depth tour of anomaly detection, the first machine learning capability to be directly integrated into the Elastic Stack. In this chapter and the following one, we will take a dive into the new machine learning features integrated into the stack. These include outlier detection, a novel unsupervised learning technique for detecting unusual data points in non-timeseries indices, as well as two supervised learning features, classification and regression.

Supervised learning algorithms use labeled datasets – for example, a dataset describing various aspects of tissue samples along with whether or not the tissue is malignant – to learn a model. This model can then be used to make predictions on previously unseen data points (or tissue samples, to continue our example). When the target of prediction is a discrete variable or a category such as a malignant or non-malignant tissue...

Technical requirements

The material in this chapter requires Elasticsearch version 7.9 or above and Python 3.7 or above. Code samples and snippets required for this chapter will be added under the folder Chapter 9 - Introduction to Data Frame Analytics in the book's GitHub repository (https://github.com/PacktPublishing/Machine-Learning-with-Elastic-Stack-Second-Edition/tree/main/Chapter%209%20-%20Introduction%20to%20Data%20Frame%20Analytics). In such cases where some examples require a specific newer release of Elasticsearch, this will be mentioned before the example is presented.

Learning how to use transforms

In this section, we are going to dive right into the world of transforming stream or event-based data, such as logs, into an entity-centric index.

Why are transforms useful?

Think about the most common data types that are ingested into Elasticsearch. These will often be documents recording some kind of time-based or sequential event, for example, logs from a web server, customer purchases from a web store, comments published on a social media platform, and so forth.

While this kind of data is useful for understanding the behavior of our systems over time and is perfect for use with technologies such as anomaly detection, it is harder to make stream- or event-based datasets work with Data Frame Analytics features without first aggregating or transforming them in some way. For example, consider an e-commerce store that records purchases made by customers. Over a year, there may be tens or hundreds of transactions for each customer. If the e-commerce...

Using Painless for advanced transform configurations

As we have seen in many of the previous sections, the built-in pivot and aggregation options allow us to analyze and interrogate our data in various ways. However, for more custom or advanced use cases, the built-in functions may not be flexible enough. For these use cases, we will need to write custom pivot and aggregation configurations. The flexible scripting language that is built into Elasticsearch, Painless, allows us to do this.

In this section, we will introduce Painless, illustrate some tools that are useful when working with Painless, and then show how Painless can be applied to create custom Transform configurations.

Introducing Painless

Painless is a scripting language that is built into Elasticsearch. We will take a look at Painless in terms of variables, control flow constructs, operations, and functions. These are the basic building blocks that will help you develop your own custom scripts to use with transforms...

Working with Python and Elasticsearch

In recent years, Python has become the dominant language for many data-intensive projects. Fueled by its easy-to-use machine learning and data analysis libraries, many data scientists and data engineers are now heavily relying on Python for most of their daily operations. Therefore, no discussions of machine learning in the Elastic Stack would be complete without exploring how a data analysis professional can work with the Elastic Stack in Python.

In this section, we will take a look at the three official Python Elasticsearch clients, understand the differences between them, and discuss when one might want to use one over the others. We will demonstrate how usage of Elastic Stack ML can be automated by using Elasticsearch clients. In addition, we will take a deeper look at Eland, the new data science native client that enables efficient in-memory data analysis backed by Elasticsearch. After exploring how Eland works, we will illustrate how...

Summary

In this section, we have dipped our toes into the world of Data Frame Analytics, a whole new branch of machine learning and data transformation tools that unlock powerful ways to use the data you have stored in Elasticsearch to solve problems. In addition to giving an overview of the new unsupervised and supervised machine learning techniques that we will cover in future chapters, we have studied three important topics: transforms, using the Painless scripting language, and the integration between Python and Elasticsearch. These topics will form the foundation of our future work in the following chapters.

In our exposition on transforms, we studied the two components – the pivot and aggregations – that make up a transform, as well as the two possible modes in which to run a transform: batch and continuous. A batch transform runs only once and generates a transformation on a snapshot of the source index at a particular point in time. This works perfectly for...

Further reading

For more information on the Jupyter ecosystem and, in particular, the Jupyter Notebook, have a look at the comprehensive documentation of Project Jupyter, here: https://jupyter.org/documentation.

If you are new to Python development and would like to have an overview of the language ecosystem and the various tools that are available, have a look at the Hitchiker's Guide to Python, here: https://docs.python-guide.org/.

To learn more about the pandas project, please see the official documentation here: https://pandas.pydata.org/.

For more information on the Painless embedded scripting language, please see the official Painless language specification, here: https://www.elastic.co/guide/en/elasticsearch/painless/current/painless-lang-spec.html.

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Machine Learning with the Elastic Stack - Second Edition
Published in: May 2021Publisher: PacktISBN-13: 9781801070034
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Authors (3)

author image
Rich Collier

Rich Collier is a solutions architect at Elastic. Joining the Elastic team from the Prelert acquisition, Rich has over 20 years' experience as a solutions architect and pre-sales systems engineer for software, hardware, and service-based solutions. Rich's technical specialties include big data analytics, machine learning, anomaly detection, threat detection, security operations, application performance management, web applications, and contact center technologies. Rich is based in Boston, Massachusetts.
Read more about Rich Collier

author image
Camilla Montonen

Camilla Montonen is a Senior Machine Learning Engineer at Elastic.
Read more about Camilla Montonen

author image
Bahaaldine Azarmi

Bahaaldine Azarmi, Global VP Customer Engineering at Elastic, guides companies as they leverage data architecture, distributed systems, machine learning, and generative AI. He leads the customer engineering team, focusing on cloud consumption, and is passionate about sharing knowledge to build and inspire a community skilled in AI.
Read more about Bahaaldine Azarmi