Reader small image

You're reading from  Getting Started with Elastic Stack 8.0

Product typeBook
Published inMar 2022
PublisherPackt
ISBN-139781800569492
Edition1st Edition
Right arrow
Author (1)
Asjad Athick
Asjad Athick
author image
Asjad Athick

Asjad Athick is a security specialist at Elastic with demonstratable experience in architecting enterprise-scale solutions on the cloud. He believes in empowering people with the right tools to help them achieve their goals. At Elastic, he works with a broad range of customers across Australia and New Zealand to help them understand their environment; this allows them to build robust threat detection, prevention, and response capabilities. He previously worked in the telecommunications space to build a security capability to help analysts identify and contextualize unknown cyber threats. With a background in application development and technology consulting, he has worked with various small businesses and start-up organizations across Australia.
Read more about Asjad Athick

Right arrow

Chapter 5: Running Machine Learning Jobs on Elasticsearch

In the previous chapter, we looked at how large volumes of data can be managed and leveraged for analytical insight. We looked at how changes in data can be detected and responded to using rules (also called alerts). This chapter explores the use of machine learning techniques to look for unknowns in data and understand trends that cannot be captured using a rule-based approach.

Machine learning is a dense subject with a wide range of theoretical and practical concepts to cover. In this chapter, we will focus on some of the more important aspects of running machine learning jobs on Elasticsearch. Specifically, we will cover the following:

  • Preparing data for machine learning
  • Running single- and multi-metric anomaly detection jobs on time series data
  • Classifying data using supervised machine learning models
  • Running machine learning inference on incoming data

Technical requirements

To use machine learning features, ensure that the Elasticsearch cluster contains at least one node with the role ml. This enables the running of machine learning jobs on the cluster:

  • If you're running with default settings on a single node, this role should already be enabled, and no further configuration is necessary.
  • If you're running nodes with custom roles, ensure the role is added to elasticsearch.yml, as follows:
    node.roles: [data, ml]

The value of running machine learning on Elasticsearch

Elasticsearch is a powerful tool when it comes to storing, searching, and aggregating large volumes of data. Dashboards and visualizations help with user-driven interrogation and exploration of data, while tools such as Watcher and Kibana alerting allow users to take automatic action when data changes in a predefined or expected manner.

However, a lot of data sources can often represent trends or insights that are hard to capture as a predefined rule or query. Consider the following example:

  • A logging platform collects application logs (using an agent) from about 5,000 endpoints across an environment.
  • The application generates a log line for every transaction executed as soon as the transaction completes.
  • After a software patch, a small subset of the endpoints can intermittently and temporarily fail to write logs successfully. The machine doesn't entirely fail as the failure is intermittent in nature.
  • ...

Preparing data for machine learning jobs

In order for machine learning jobs to analyze document field values when building baselines and identifying anomalies, it is important to ensure the index mappings are accurately defined. Furthermore, it is useful to parse out complex fields (using ETL tools or ingest pipelines) into their own subfields to use in machine learning jobs.

The machine learning application provides useful functionality to visualize the index you're looking to run jobs on, and ensure mappings and values are as expected. The UI lists all fields, data types, and some sample values where appropriate.

Navigate to the machine learning app on Kibana and perform the following steps:

  1. Click on the Data Visualizer tab.
  2. Select the webapp data view you created in the previous section.
  3. Click on Use full webapp data to automatically update the time range filter for the full duration of your dataset.
  4. Inspect the fields in the index and confirm all...

Looking for anomalies in time series data

Given the logs in the webapp index, there is some concern that there was some potentially undesired activity happening on the application. This could be completely benign or have malicious consequences. This section will look at how a series of machine learning jobs can be implemented to better understand and analyze the activity in the logs.

Looking for anomalous event rates in application logs

We will use a single-metric machine learning job to build a baseline for the number of log events generated by the application during normal operation.

Follow these steps to configure the job:

  1. Open the machine learning app from the navigation menu and click on the Anomaly Detection tab.
  2. Click on Create job and select the webapp data view. You could optionally use a saved search here with predefined filters applied to narrow down the data used for the job.
  3. Create a single-metric job as we're only interested in the event...

Running classification on data

Unsupervised anomaly detection is useful when looking for abnormal or unexpected behavior in a dataset to guide investigation and analysis. It can unearth silent faults, unexpected usage patterns, resource abuse, or malicious user activity. This is just one class of use cases enabled by machine learning.

It is common to have historical data where, with post analysis, it is rather easy to label or tag this data with a meaningful value. For example, if you have access to service usage data for your subscription-based online application along with a record of canceled subscriptions, you could tag snapshots of the usage activity with a label indicating whether the customer churned.

Consider a different example where an IT team has access to web application logs where, with post analysis, given the request payloads are different to normal requests originating from the application, they can label events that indicate malicious activity, such as password...

Inferring against incoming data using machine learning

As we learned in Chapter 4, Leveraging Insights and Managing Data on Elasticsearch, ingest pipelines can be used to transform, process, and enrich incoming documents before indexing. Ingest pipelines provide an inference processor to run new documents through a trained machine learning model to infer classification or regression results.

Follow these instructions to create and test an ingest pipeline to run inference using the trained machine learning model:

  1. Create a new ingest pipeline as follows. model_id will defer across Kibana instances and can be retrieved from the model pane in the Data Frame Analytics tab on Kibana. model_id in this case is classification-request-payloads-1615680927179:
    PUT _ingest/pipeline/ml-malicious-request
    {
      "processors": [
        {
          "inference": {
            "model_id...

Summary

In this chapter, we looked at applying supervised and unsupervised machine learning techniques on data in Elasticsearch for various use cases.

First, we explored the use of unsupervised learning to look for anomalous behavior in time series data. We used single-metric, multi-metric, and population jobs to analyze a dataset of web application logs to look for potentially malicious activity.

Next, we looked at the use of supervised learning to train a machine learning model for classifying to classify requests to the web application as malicious using features in the request (primarily the HTTP request/response size values).

Finally, we looked at how the inference processor in ingest pipelines can be used to run continuous inference using a trained model for new data.

In the next chapter, we will move our focus to Beats and their role in the data pipeline. We will look at how different types of events can be collected by Beats agents and sent to Elasticsearch or Logstash...

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Getting Started with Elastic Stack 8.0
Published in: Mar 2022Publisher: PacktISBN-13: 9781800569492
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Asjad Athick

Asjad Athick is a security specialist at Elastic with demonstratable experience in architecting enterprise-scale solutions on the cloud. He believes in empowering people with the right tools to help them achieve their goals. At Elastic, he works with a broad range of customers across Australia and New Zealand to help them understand their environment; this allows them to build robust threat detection, prevention, and response capabilities. He previously worked in the telecommunications space to build a security capability to help analysts identify and contextualize unknown cyber threats. With a background in application development and technology consulting, he has worked with various small businesses and start-up organizations across Australia.
Read more about Asjad Athick