Reader small image

You're reading from  F# for Machine Learning Essentials

Product typeBook
Published inFeb 2016
Reading LevelExpert
Publisher
ISBN-139781783989348
Edition1st Edition
Languages
Right arrow
Author (1)
Sudipta Mukherjee
Sudipta Mukherjee
author image
Sudipta Mukherjee

Sudipta Mukherjee was born in Kolkata and migrated to Bangalore. He is an electronics engineer by education and a computer engineer/scientist by profession and passion. He graduated in 2004 with a degree in electronics and communication engineering. He has a keen interest in data structure, algorithms, text processing, natural language processing tools development, programming languages, and machine learning at large. His first book on Data Structure using C has been received quite well. Parts of the book can be read on Google Books. The book was also translated into simplified Chinese, available from Amazon.cn. This is Sudipta's second book with Packt Publishing. His first book, .NET 4.0 Generics , was also received very well. During the last few years, he has been hooked to the functional programming style. His book on functional programming, Thinking in LINQ, was released in 2014. He lives in Bangalore with his wife and son. Sudipta can be reached via e-mail at sudipto80@yahoo.com and via Twitter at @samthecoder.
Read more about Sudipta Mukherjee

Right arrow

Chapter 7. Anomaly Detection

"Find the odd one out automatically."

Anomaly Detection is the art of finding the odd one out in a bunch of data automatically. As the name suggests, it is the science of finding the data that is anomalous compared to the other. However, there are several kinds of anomalies; sometimes the normal (non-anomalous) data can be line anomalous data. Anomaly detection is mostly an unsupervised learning problem because it is very difficult, if not impossible, to get a labeled training dataset that is anomalous. Sometimes, anomalies are referred to as "outliers."

Objective


After reading this chapter, you will be able to apply some of the techniques to identify any anomaly in data, and you will have a general understanding of how and where anomaly detection algorithms can be useful. All code is available at https://gist.github.com/sudipto80/e599ab069981736ffa1d.

Different classification algorithms

The following algorithms will be discussed in this chapter:

  • Statistical anomaly detection

  • Nearest neighbor based anomaly detection

  • Density estimation based anomaly detection

Some cool things you will do

With the techniques learned from this chapter, you will be able to spot lies. You will also have a deeper understanding of how fraudulent behaviors on credit cards are found.

The different types of anomalies

Anomalies can be classified into any of the following categories:

  • Point anomalies

  • Contextual anomalies

  • Collective anomalies

We will go through each one of them in detail:

  • Point anomalies: If an individual data instance can be considered as anomalous with respect...

Detecting point anomalies using IQR (Interquartile Range)


The basic algorithm to find anomalies or outliers is based on the quartile range. The basic idea behind this approach is that it believes that elements falling far off both sides of the normal distribution are anomalous. These far-off sides are determined by the boundaries of the box plot.

In descriptive statistics, the interquartile range (IQR), also called the midspread or middle fifty, is a measure of statistical dispersion equal to the difference between the upper and lower quartiles, IQR = Q3 − Q1. In other words, the IQR is the 1st quartile subtracted from the 3rd quartile; these quartiles can be clearly seen on a box plot in the data. It is a trimmed estimator, defined as the 25% trimmed range, and is the most significant, basic, and robust measure of scale.

The interquartile range is often used to find outliers in data. Outliers are observations that fall below Q1 - 1.5(IQR) or above Q3 + 1.5(IQR). In a boxplot, the highest...

Detecting point anomalies using Grubb's test


Grubb's test (also known as the maximum normed residual test) is used to detect anomalies in a univariate dataset (which means there is only one variable per data instance) under the assumption that the data is generated by a Gaussian distribution. For each test instance , its score is computed as follows:

Where is the average of the data in the instances and is the standard deviation of the data points.

The following functions determine the scores of each element in the list:

A data instance is declared to be anomalous if it fulfills the following condition:

Here, is the number of elements in the collection and is the threshold used to declare an instance to be anomalous or normal.

The following function finds the elements where the score indicates that the element might be anomalous. The xs parameter denotes the entire collection and t denotes the value of .

The following code shows you how to use these functions to find anomalous data instances...

Grubb's test for multivariate data using Mahalanobis distance


Grubb's test can be used for multivariate data by transforming multivariate data to univariate data using the following transformation:

Where is the covariance matrix of .

The following code finds these y-squared values from a given :

The following are the functions to calculate the covariance matrix:

The following is the input given:

This produces the following output:

ys = [([2.0; 2.0], -48066176.91); ([2.0; 5.0], -48066176.91);
 ([6.0; 5.0], -2584692.113); ([100.0; 345.0], -2.097348892e+12)]

Now, Grubb's test for univariate data can be applied on top of these generated values:

[-48066176.91; -48066176.91; -2584692.113; -2.097348892e+12]

The z scores of these values are:

[0.5773335755; 0.5773335755; 0.5773836562; 1.732050807]

As you can see, the z-score corresponding to the last entry is considerably bigger than the z-score of the rest. This means the last element in the multivariate dataset (which is [100;345]) is anomalous.

Imagine...

Chi-squared statistic to determine anomalies


Ye and Chen used a statistic to determine anomalies in the operating system call data. The training phase assumes that the normal data has a multivariate normal distribution. The value of the statistic is determined as:

Where denotes the observed value of the ith variable, is the expected value of the ith variable (obtained from the training data), and n is the number of variables. A large value of denotes that the observed sample contains anomalies.

The following function calculates the respective values for all the elements in a collection:

When this function is called with the same data [1.;100.;2.;4.5;2.55;70.] as the observed data and [111.;100.;2.;4.5;2.55;710.] as the expected values then the following result is obtained:

[(1.0, 12100.0); (100.0, 0.0); (2.0, 0.0); (4.5, 0.0); (2.55, 0.0);
   (70.0, 5851.428571)]

As you can see, the value of is very high (121000.0 and 5851.428571) in the first and last observations. This means that the...

Detecting anomalies using density estimation


In general, normal elements are more common than anomalous entries in any system. So, if the probability of the occurrence of elements in a collection is modeled by the Gaussian or normal distribution, then we can conclude that the elements for which the estimated probability density is more than a predefined threshold are normal, and those for which the value is less than a predefined threshold are probably anomalies.

Let's say that is a random variable of rows. The following couple of formulae find the average and standard deviations for feature , or, in other words, for all the elements of in the jth column if is represented as a matrix.

Given a new entry x, the following formula calculates the probability density estimation:

If is less than a predefined threshold, then the entry is tagged to be anomalous, else it is tagged as normal.

The following code finds the average value of the jth feature:

Here is a sample run of the px method:

>...

Strategy to convert a collective anomaly to a point anomaly problem


A collective anomaly can be converted to a point anomaly problem and then solved using the techniques mentioned above. Each contextual anomaly can be represented as a point anomaly in N dimension where N is the size of the sliding window. Let's say that we have the following numbers: 1;45;1;3;54;1;45;24;5;23;5;5. Then a sliding window of size 4 will produce the following series of collections can be generated by the following code

This produces the following lists:

val data : int list = [1; 45; 1; 3; 54; 1; 45; 24; 5; 23; 5; 5]
val windowSize : int = 3
val indices : int list list =
  [[1; 45; 1]; [45; 1; 3]; [1; 3; 54]; [3; 54; 1]; [54; 1; 45];
   [1; 45; 24];[45; 24; 5]; [24; 5; 23]; [5; 23; 5]; [23; 5; 5]]

Now, as you have seen before, all of these lists can be represented as one point in three dimensions and Grubb's test for multivariate data.

Dealing with categorical data in collective anomalies


As an another illustrative example, consider a sequence of actions occurring in a computer, as shown below:

: : : http-web, buffer-overflow, http-web, http-web, smtp-mail, ftp, http-web, ssh, smtp-
mail, http-web, ssh, buffer-overflow, ftp, http-web, ftp, smtp-mail,http-web : : :

The highlighted sequence of events (buffer-overflow, ssh, ftp) corresponds to a typical, web-based attack by a remote machine followed by the copying of data from the host computer to a remote destination via ftp. It should be noted that this collection of events is an anomaly, but the individual events are not anomalies when they occur in other locations in the sequence.

These types of categorical data can be transformed into numeric data by assigning a particular number for each command. If the following mapping is applied to transform categorical data to numeric data:

Summary


Anomaly detection is a very active field of research because what's anomalous now may not remain anomalous forever. This poses a significant challenge to designing a good anomaly detection algorithm. Although the algorithms discussed in this chapter mostly deal with point anomalies, they can be also used to detect sequential anomalies with a little bit of feature extraction.

Sometimes, anomaly detection is treated as a classification problem, and several classification algorithms such as k-NN, SVM, and Neural Networks are deployed to identify anomalous entries. The challenge, however, is to get well-labeled data. However, some heuristics are used to assign a score called the anomaly score to each data element, and then the top few with the highest anomaly scores (sometimes above a given threshold) are determined to be anomalous.

Anomaly detection has several applications, such as finding imposters using anomaly detection on keyboard dynamics, pedestrians, and landmine detection from...

lock icon
The rest of the chapter is locked
You have been reading a chapter from
F# for Machine Learning Essentials
Published in: Feb 2016Publisher: ISBN-13: 9781783989348
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Sudipta Mukherjee

Sudipta Mukherjee was born in Kolkata and migrated to Bangalore. He is an electronics engineer by education and a computer engineer/scientist by profession and passion. He graduated in 2004 with a degree in electronics and communication engineering. He has a keen interest in data structure, algorithms, text processing, natural language processing tools development, programming languages, and machine learning at large. His first book on Data Structure using C has been received quite well. Parts of the book can be read on Google Books. The book was also translated into simplified Chinese, available from Amazon.cn. This is Sudipta's second book with Packt Publishing. His first book, .NET 4.0 Generics , was also received very well. During the last few years, he has been hooked to the functional programming style. His book on functional programming, Thinking in LINQ, was released in 2014. He lives in Bangalore with his wife and son. Sudipta can be reached via e-mail at sudipto80@yahoo.com and via Twitter at @samthecoder.
Read more about Sudipta Mukherjee

Command

Numeric Representation

http-web

1

ssh

2

buffer-overflow

3

ftp

4

...