You're reading from Machine Learning with the Elastic Stack - Second Edition

Product typeBook

Published inMay 2021

Reading LevelBeginner

PublisherPackt

ISBN-139781801070034

Edition2nd Edition

Languages

Python

Tools

Elasticsearch

Concepts

Machine Learning

Authors (3):

Rich Collier

Camilla Montonen

Bahaaldine Azarmi

View More author details

Chapter 10: Outlier Detection

In the first section of this book, we discussed anomaly detection in depth, a feature that allows us to detect unusual behavior in time series data in an unsupervised fashion. This works well when we want to detect whether one of our applications is experiencing unusual latency at a particular time or whether a host on our corporate network is transmitting an unusual number of bytes.

In this chapter, we will learn about the second unsupervised learning feature in the Elastic Stack: outlier detection, which allows us to detect unusual entities in non-time series-based indices. Some interesting applications of outlier detection could involve, for example, detecting unusual cells in a tissue sample, investigating unusual houses, or areas in a local real estate market and catching unusual binaries installed on your computer.

The outlier detection functionality in the Elastic Stack is based on an ensemble or a grouping of four different outlier detection...

Technical requirements

The material in this chapter relies on using Elasticsearch version 7.9 or above. The figures in this chapter have been generated using Elasticsearch 7.10. Code snippets and code examples used in this chapter are under the chapter10 folder in the book's GitHub repository: https://github.com/PacktPublishing/Machine-Learning-with-Elastic-Stack-Second-Edition.

Discovering how outlier detection works

Outlier detection can offer insights into datasets by discovering which points are different or unusual, but how does outlier detection in the Elastic Stack work? To understand how outlier detection functionality can be constructed, let's start by thinking conceptually about how you would design the algorithm, and then see how our conceptual ideas can be formalized into the four separate algorithms that make up the outlier detection ensemble in Elasticsearch.

Suppose for a second that we have a two-dimensional set of weight and circumference measurements...

Applying outlier detection in practice

In this section, we will take a look at a practical example of outlier detection using a public dataset describing the physicochemical properties of wine. This dataset is available for download from the University of California Irvine (UCI) repository (https://archive.ics.uci.edu/ml/datasets/wine+quality).

The wine dataset is composed of two CSV files: one describing the physicochemical properties of white wine, the other those of red wine. In this walk-through, we will be focusing on the white wine dataset, but you are welcome to use the data for red wine as well since most of the steps described in this chapter should be applicable to both.

First let's import the dataset into our Elasticsearch cluster using the Data Visualizer tool, which you can find under the Machine Learning app in Kibana. We will make an index for the white wine dataset and call it winequality-white:

Figure 10.7 – The...

Evaluating outlier detection with the Evaluate API

In the previous section, we touched on the fact it can be hard for a user to know how to set the threshold for outlier scores in order to group the data points in the dataset into normal and outlier categories. In this section, we will show how to approach this issue if you have a labeled dataset that contains, for each point, the ground truth values that record whether the point is an outlier. Before we dive into the practical demonstration, let's take a moment to understand some key performance metrics that are used in evaluating the performance of the outlier detection algorithm.

One of the simplest ways we can measure the performance of the algorithm is to compute the number of data points that it correctly predicted as outliers; in other words, the number of true positives (TPs). In addition, we also want to know the number of true negatives (TNs): how many normal data points were correctly predicted as normal. By extension...

Hyperparameter tuning for outlier detection

For the more advanced user, the Data Frame Analytics wizard offers an opportunity to configure and tune hyperparameters – various knobs and dials that fine-tune how the outlier detection algorithm works. The available hyperparameters are displayed in Figure 10.17. For example, we can direct the outlier detection job to use only a certain type of outlier detection method instead of the ensemble, to use a certain value for the number of nearest neighbors that are used in the computation in the ensemble, and to assume that a certain portion of the data is outlying.

Please note that while it is good to play around with these settings to experiment and get a feel for how they affect the final results, if you want to customize any of these for a production usecase, you should carefully study the characteristics of your data and have an awareness of how these characteristics will interact with your chosen hyperparameter settings. More...

Summary

To conclude the chapter, let's remind ourselves of the main features of the second unsupervised learning feature in the Elastic Stack: outlier detection. Outlier detection can be used to detect unusual data points in single or multidimensional datasets.

The algorithm is based on an ensemble of four separate measures: two distance-based measures based on kth-nearest neighbors and two density-based measures. The combination of these measures captures how far a given data point is from its neighbors and from the general mass of data in the dataset. This unusualness is captured in a numerical outlier score that ranges from 0 to 1. The closer a given data point scores to 1, the more unusual it is in the dataset.

In addition to the outlier score, for each feature or field of a point, we compute a quantity known as the feature influence. The higher the feature influence for a given field, the more that field is responsible for a given point being unusual. These feature...

The rest of the chapter is locked

You have been reading a chapter from

Machine Learning with the Elastic Stack - Second Edition

Published in: May 2021Publisher: PacktISBN-13: 9781801070034

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at $15.99/month. Cancel anytime

Authors (3)

Rich Collier

Rich Collier is a solutions architect at Elastic. Joining the Elastic team from the Prelert acquisition, Rich has over 20 years' experience as a solutions architect and pre-sales systems engineer for software, hardware, and service-based solutions. Rich's technical specialties include big data analytics, machine learning, anomaly detection, threat detection, security operations, application performance management, web applications, and contact center technologies. Rich is based in Boston, Massachusetts.
Read more about Rich Collier

Camilla Montonen

Camilla Montonen is a Senior Machine Learning Engineer at Elastic.
Read more about Camilla Montonen

Bahaaldine Azarmi

Bahaaldine Azarmi, Global VP Customer Engineering at Elastic, guides companies as they leverage data architecture, distributed systems, machine learning, and generative AI. He leads the customer engineering team, focusing on cloud consumption, and is passionate about sharing knowledge to build and inspire a community skilled in AI.
Read more about Bahaaldine Azarmi

Other recommended products

Related to this chapter

Machine Learning with the Elastic Stack

Elastic has announced the integration of Prelert machine learning technology within its ecosystem allowing real-time generation of business insights from the Elasticsearch data without it leaving the cluster at all. This book will demonstrate these unique features and teach you to perform machine learning on the Elastic Stack without any hassle.

BookJan 2019304 pages

Learning Kibana 7

This book will introduce you to Kibana 7, and will show you how it fits into the Elastic stack. You will build a pure metric analytics architecture and visualize it using Timelion. You will also learn how to build relationships between documents using Graph visualization. You will also learn to build powerful Elastic dashboards using Kibana.

BookJul 2019280 pages

Mastering Kibana 6.x

Mastering Kibana 6.x provides a rundown explanation required for data visualization and analysis such as X-Pack features, Beats, and machine learning. You will be expert in creating analytics-driven visualizations from a web application. You will be a maestro in creating custom monitoring dashboard using Beats with various examples

BookJul 2018376 pages

Advanced Elasticsearch 7.0

Advanced Elasticsearch 7.0, will help the readers to leverage new features and Core APIs of Elasticsearch to perform advanced search operations. This book covers data modeling, aggregations, pipeline processing, and data Analytics using Elasticsearch

BookAug 2019560 pages

Threat Hunting with Elastic Stack

Elastic security offers enhanced threat hunting capabilities to build active defense strategies. Complete with practical examples and tips, this easy-to-follow guide will help you enhance your security skills by leveraging the Elastic Stack for security monitoring, incident response, intelligence analysis, or threat hunting.

BookJul 2021392 pages

Learning Kibana 5.0

BookFeb 2017284 pages

Personalised recommendations for you

Based on your interests and search pattern

Et al.

Ever wonder why speech recognition systems don't understand the Scottish accent, or what would happen if an astronaut only ate mac 'n' cheese, or other spurious reflections you'd have at a bar? We did, then collated those deliberations into absurd research articles with fake figures and methodologies inspired by even more fictionally absurd studies.

BookAug 2023230 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages4

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages1

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Mastering Tableau 2023

This book is a comprehensive resource to mastering your Tableau skills and becoming a BI expert. As you progress, you will learn how to build advanced dashboards and improve your storytelling to derive key business insight, as well as make you well-versed with advanced functionalities of Tableau in the business intelligence domain.

BookAug 2023684 pages

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages5

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages2

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

BookOct 2023636 pages5

Modern Data Architecture on AWS

Every organization wants an agile, performant, and cost-effective data platform that meets all their current and future business needs. Purpose-built AWS analytics services and their features play a big part in building such a modern data platform. This book brings to you all the design and architectural patterns that’ll help you achieve this goal.

BookAug 2023420 pages5

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

BookDec 2023240 pages

TinyML Cookbook

With over 70 project-based recipes, the TinyML Cookbook is a practical guide that will help you to get the most out of your microcontrollers. It provides a comprehensive understanding of the theoretical foundations while giving you hands-on experience training ML models for deployment on Arduino Nano 33 BLE Sense, Raspberry Pi Pico, and SparkFun RedBoard Artemis Nano microcontrollers.

BookNov 2023664 pages