Packt+ | Advance your knowledge in tech

You're reading from F# for Machine Learning Essentials

Product typeBook

Published inFeb 2016

Reading LevelExpert

Publisher

ISBN-139781783989348

Edition1st Edition

Languages

Tools

Visual Studio

Concepts

Machine Learning

Author (1)

Sudipta Mukherjee

Chapter 7. Anomaly Detection

"Find the odd one out automatically."

Anomaly Detection is the art of finding the odd one out in a bunch of data automatically. As the name suggests, it is the science of finding the data that is anomalous compared to the other. However, there are several kinds of anomalies; sometimes the normal (non-anomalous) data can be line anomalous data. Anomaly detection is mostly an unsupervised learning problem because it is very difficult, if not impossible, to get a labeled training dataset that is anomalous. Sometimes, anomalies are referred to as "outliers."

Objective

After reading this chapter, you will be able to apply some of the techniques to identify any anomaly in data, and you will have a general understanding of how and where anomaly detection algorithms can be useful. All code is available at https://gist.github.com/sudipto80/e599ab069981736ffa1d.

Different classification algorithms

The following algorithms will be discussed in this chapter:

Statistical anomaly detection
Nearest neighbor based anomaly detection
Density estimation based anomaly detection

Some cool things you will do

With the techniques learned from this chapter, you will be able to spot lies. You will also have a deeper understanding of how fraudulent behaviors on credit cards are found.

The different types of anomalies

Anomalies can be classified into any of the following categories:

Point anomalies
Contextual anomalies
Collective anomalies

We will go through each one of them in detail:

Point anomalies: If an individual data instance can be considered as anomalous with respect...

Detecting point anomalies using IQR (Interquartile Range)

The basic algorithm to find anomalies or outliers is based on the quartile range. The basic idea behind this approach is that it believes that elements falling far off both sides of the normal distribution are anomalous. These far-off sides are determined by the boundaries of the box plot.

In descriptive statistics, the interquartile range (IQR), also called the midspread or middle fifty, is a measure of statistical dispersion equal to the difference between the upper and lower quartiles, IQR = Q3 − Q1. In other words, the IQR is the 1st quartile subtracted from the 3rd quartile; these quartiles can be clearly seen on a box plot in the data. It is a trimmed estimator, defined as the 25% trimmed range, and is the most significant, basic, and robust measure of scale.

The interquartile range is often used to find outliers in data. Outliers are observations that fall below Q1 - 1.5(IQR) or above Q3 + 1.5(IQR). In a boxplot, the highest...

Detecting point anomalies using Grubb's test

Grubb's test (also known as the maximum normed residual test) is used to detect anomalies in a univariate dataset (which means there is only one variable per data instance) under the assumption that the data is generated by a Gaussian distribution. For each test instance , its score is computed as follows:

Where is the average of the data in the instances and is the standard deviation of the data points.

The following functions determine the scores of each element in the list:

A data instance is declared to be anomalous if it fulfills the following condition:

Here, is the number of elements in the collection and is the threshold used to declare an instance to be anomalous or normal.

The following function finds the elements where the score indicates that the element might be anomalous. The xs parameter denotes the entire collection and t denotes the value of .

The following code shows you how to use these functions to find anomalous data instances...

Grubb's test for multivariate data using Mahalanobis distance

Grubb's test can be used for multivariate data by transforming multivariate data to univariate data using the following transformation:

Where is the covariance matrix of .

The following code finds these y-squared values from a given :

The following are the functions to calculate the covariance matrix:

The following is the input given:

This produces the following output:

ys = [([2.0; 2.0], -48066176.91); ([2.0; 5.0], -48066176.91);
 ([6.0; 5.0], -2584692.113); ([100.0; 345.0], -2.097348892e+12)]

Now, Grubb's test for univariate data can be applied on top of these generated values:

[-48066176.91; -48066176.91; -2584692.113; -2.097348892e+12]

The z scores of these values are:

[0.5773335755; 0.5773335755; 0.5773836562; 1.732050807]

As you can see, the z-score corresponding to the last entry is considerably bigger than the z-score of the rest. This means the last element in the multivariate dataset (which is [100;345]) is anomalous.

Imagine...

Chi-squared statistic to determine anomalies

Ye and Chen used a statistic to determine anomalies in the operating system call data. The training phase assumes that the normal data has a multivariate normal distribution. The value of the statistic is determined as:

Where denotes the observed value of the ith variable, is the expected value of the ith variable (obtained from the training data), and n is the number of variables. A large value of denotes that the observed sample contains anomalies.

The following function calculates the respective values for all the elements in a collection:

When this function is called with the same data [1.;100.;2.;4.5;2.55;70.] as the observed data and [111.;100.;2.;4.5;2.55;710.] as the expected values then the following result is obtained:

[(1.0, 12100.0); (100.0, 0.0); (2.0, 0.0); (4.5, 0.0); (2.55, 0.0);
   (70.0, 5851.428571)]

As you can see, the value of is very high (121000.0 and 5851.428571) in the first and last observations. This means that the...

Detecting anomalies using density estimation

In general, normal elements are more common than anomalous entries in any system. So, if the probability of the occurrence of elements in a collection is modeled by the Gaussian or normal distribution, then we can conclude that the elements for which the estimated probability density is more than a predefined threshold are normal, and those for which the value is less than a predefined threshold are probably anomalies.

Let's say that is a random variable of rows. The following couple of formulae find the average and standard deviations for feature , or, in other words, for all the elements of in the jth column if is represented as a matrix.

Given a new entry x, the following formula calculates the probability density estimation:

If is less than a predefined threshold, then the entry is tagged to be anomalous, else it is tagged as normal.

The following code finds the average value of the jth feature:

Here is a sample run of the px method:

>...

Strategy to convert a collective anomaly to a point anomaly problem

A collective anomaly can be converted to a point anomaly problem and then solved using the techniques mentioned above. Each contextual anomaly can be represented as a point anomaly in N dimension where N is the size of the sliding window. Let's say that we have the following numbers: 1;45;1;3;54;1;45;24;5;23;5;5. Then a sliding window of size 4 will produce the following series of collections can be generated by the following code

This produces the following lists:

val data : int list = [1; 45; 1; 3; 54; 1; 45; 24; 5; 23; 5; 5]
val windowSize : int = 3
val indices : int list list =
  [[1; 45; 1]; [45; 1; 3]; [1; 3; 54]; [3; 54; 1]; [54; 1; 45];
   [1; 45; 24];[45; 24; 5]; [24; 5; 23]; [5; 23; 5]; [23; 5; 5]]

Now, as you have seen before, all of these lists can be represented as one point in three dimensions and Grubb's test for multivariate data.

Dealing with categorical data in collective anomalies

As an another illustrative example, consider a sequence of actions occurring in a computer, as shown below:

: : : http-web, buffer-overflow, http-web, http-web, smtp-mail, ftp, http-web, ssh, smtp-
mail, http-web, ssh, buffer-overflow, ftp, http-web, ftp, smtp-mail,http-web : : :

The highlighted sequence of events (buffer-overflow, ssh, ftp) corresponds to a typical, web-based attack by a remote machine followed by the copying of data from the host computer to a remote destination via ftp. It should be noted that this collection of events is an anomaly, but the individual events are not anomalies when they occur in other locations in the sequence.

These types of categorical data can be transformed into numeric data by assigning a particular number for each command. If the following mapping is applied to transform categorical data to numeric data:

Summary

Anomaly detection is a very active field of research because what's anomalous now may not remain anomalous forever. This poses a significant challenge to designing a good anomaly detection algorithm. Although the algorithms discussed in this chapter mostly deal with point anomalies, they can be also used to detect sequential anomalies with a little bit of feature extraction.

Sometimes, anomaly detection is treated as a classification problem, and several classification algorithms such as k-NN, SVM, and Neural Networks are deployed to identify anomalous entries. The challenge, however, is to get well-labeled data. However, some heuristics are used to assign a score called the anomaly score to each data element, and then the top few with the highest anomaly scores (sometimes above a given threshold) are determined to be anomalous.

Anomaly detection has several applications, such as finding imposters using anomaly detection on keyboard dynamics, pedestrians, and landmine detection from...

The rest of the chapter is locked

You have been reading a chapter from

F# for Machine Learning Essentials

Published in: Feb 2016Publisher: ISBN-13: 9781783989348

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at $15.99/month. Cancel anytime

Author (1)

Sudipta Mukherjee

Sudipta Mukherjee was born in Kolkata and migrated to Bangalore. He is an electronics engineer by education and a computer engineer/scientist by profession and passion. He graduated in 2004 with a degree in electronics and communication engineering. He has a keen interest in data structure, algorithms, text processing, natural language processing tools development, programming languages, and machine learning at large. His first book on Data Structure using C has been received quite well. Parts of the book can be read on Google Books. The book was also translated into simplified Chinese, available from Amazon.cn. This is Sudipta's second book with Packt Publishing. His first book, .NET 4.0 Generics , was also received very well. During the last few years, he has been hooked to the functional programming style. His book on functional programming, Thinking in LINQ, was released in 2014. He lives in Bangalore with his wife and son. Sudipta can be reached via e-mail at sudipto80@yahoo.com and via Twitter at @samthecoder.
Read more about Sudipta Mukherjee

Personalised recommendations for you

Based on your interests and search pattern

Et al.

Ever wonder why speech recognition systems don't understand the Scottish accent, or what would happen if an astronaut only ate mac 'n' cheese, or other spurious reflections you'd have at a bar? We did, then collated those deliberations into absurd research articles with fake figures and methodologies inspired by even more fictionally absurd studies.

BookAug 2023230 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages4

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages1

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Mastering Tableau 2023

This book is a comprehensive resource to mastering your Tableau skills and becoming a BI expert. As you progress, you will learn how to build advanced dashboards and improve your storytelling to derive key business insight, as well as make you well-versed with advanced functionalities of Tableau in the business intelligence domain.

BookAug 2023684 pages

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages5

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages2

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

BookOct 2023636 pages5

Modern Data Architecture on AWS

Every organization wants an agile, performant, and cost-effective data platform that meets all their current and future business needs. Purpose-built AWS analytics services and their features play a big part in building such a modern data platform. This book brings to you all the design and architectural patterns that’ll help you achieve this goal.

BookAug 2023420 pages5

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

BookDec 2023240 pages

TinyML Cookbook

With over 70 project-based recipes, the TinyML Cookbook is a practical guide that will help you to get the most out of your microcontrollers. It provides a comprehensive understanding of the theoretical foundations while giving you hands-on experience training ML models for deployment on Arduino Nano 33 BLE Sense, Raspberry Pi Pico, and SparkFun RedBoard Artemis Nano microcontrollers.

BookNov 2023664 pages

Command	Numeric Representation
`http-web`	1
`ssh`	2
`buffer-overflow`	3
`ftp`	4
...