Reader small image

You're reading from  Python Deep Learning

Product typeBook
Published inApr 2017
Reading LevelIntermediate
PublisherPackt
ISBN-139781786464453
Edition1st Edition
Languages
Right arrow
Authors (4):
Valentino Zocca
Valentino Zocca
author image
Valentino Zocca

Valentino Zocca has a PhD degree and graduated with a Laurea in mathematics from the University of Maryland, USA, and University of Rome, respectively, and spent a semester at the University of Warwick. He started working on high-tech projects of an advanced stereo 3D Earth visualization software with head tracking at Autometric, a company later bought by Boeing. There he developed many mathematical algorithms and predictive models, and using Hadoop he automated several satellite-imagery visualization programs. He has worked as an independent consultant at the U.S. Census Bureau, in the USA and in Italy. Currently, Valentino lives in New York and works as an independent consultant to a large financial company.
Read more about Valentino Zocca

Gianmario Spacagna
Gianmario Spacagna
author image
Gianmario Spacagna

Gianmario Spacagna is a senior data scientist at Pirelli, processing sensors and telemetry data for internet of things (IoT) and connected-vehicle applications. He works closely with tire mechanics, engineers, and business units to analyze and formulate hybrid, physics-driven, and data-driven automotive models. His main expertise is in building ML systems and end-to-end solutions for data products. He holds a master's degree in telematics from the Polytechnic of Turin, as well as one in software engineering of distributed systems from KTH, Stockholm. Prior to Pirelli, he worked in retail and business banking (Barclays), cyber security (Cisco), predictive marketing (AgilOne), and did some occasional freelancing.
Read more about Gianmario Spacagna

Daniel Slater
Daniel Slater
author image
Daniel Slater

Daniel Slater started programming at age 11, developing mods for the id Software game Quake. His obsession led him to become a developer working in the gaming industry on the hit computer game series Championship Manager. He then moved into finance, working on risk- and high-performance messaging systems. He now is a staff engineer working on big data at Skimlinks to understand online user behavior. He spends his spare time training AI to beat computer games. He talks at tech conferences about deep learning and reinforcement learning; and the name of his blog is Daniel Slater's blog. His work in this field has been cited by Google.
Read more about Daniel Slater

Peter Roelants
Peter Roelants
author image
Peter Roelants

Peter Roelants holds a master's in computer science with a specialization in AI from KU Leuven. He works on applying deep learning to a variety of problems, such as spectral imaging, speech recognition, text understanding, and document information extraction. He currently works at Onfido as a team leader for the data extraction research team, focusing on data extraction from official documents.
Read more about Peter Roelants

View More author details
Right arrow

Chapter 9. Anomaly Detection

In Chapter 4, Unsupervised Feature Learning, we saw the mechanisms of feature learning and in particular the use of auto-encoders as an unsupervised pre-training step for supervised learning tasks.

In this chapter, we are going to apply similar concepts, but for a different use case, anomaly detection.

One of the determinants for a good anomaly detector is finding smart data representations that can easily evince deviations from the normal distribution. Deep auto-encoders work very well in learning high-level abstractions and non-linear relationships of the underlying data. We will show how deep learning is a great fit for anomaly detection.

In this chapter, we will start by explaining the differences and communalities of concepts between outlier detection and anomaly detection. The reader will be guided through an imaginary fraud case study followed by examples showing the danger of having anomalies in real-world applications and the importance of automated and...

What is anomaly and outlier detection?


Anomaly detection, often related to outlier detection and novelty detection, is the identification of items, events, or observations that deviate considerably from an expected pattern observed in a homogeneous dataset.

Anomaly detection is about predicting the unknown.

Whenever we find a discordant observation in the data, we could call it an anomaly or outlier. Although the two words are often used interchangeably, they actual refer to two different concepts, as Ravi Parikh describes in one of his blog posts (http://data.heapanalytics.com/garbage-in-garbage-out- https://blog.heapanalytics.com/garbage-in-garbage-out-how-anomalies-can-wreck-your-data/):

"An outlier is a legitimate data point that's far away from the mean or median in a distribution. It may be unusual, like a 9.6-second 100-meter dash, but still within the realm of reality. An anomaly is an illegitimate data point that's generated by a different process than whatever generated the rest...

Real-world applications of anomaly detection


Anomalies can happen in any system. Technically, you can always find a never-seen-before event that could not be found in the system's historical data. The implications of detecting those observations in some contexts can have a great impact (positive and negative).

In the field of law enforcement, anomaly detection could be used to reveal criminal activities (supposing you are in an area where the average person is honest enough to identify criminals standing out of the distribution).

In a network system, anomaly detection can help at finding external intrusions or suspicious activities of users, for instance, an employee who is accidentally or intentionally leaking large amounts of data outside the company intranet. Or maybe a hacker opening connections on non-common ports and/or protocols. In the specific case of Internet security, anomaly detection could be used for stopping new malware from spreading out by simply looking at spikes of visitors...

Anomaly detection using deep auto-encoders


The proposed approach using deep learning is semi-supervised and it is broadly explained in the following three steps:

  1. Identify a set of data that represents the normal distribution. In this context, the word "normal" represents a set of points that we are confident to majorly represent non-anomalous entities and not to be confused with the Gaussian normal distribution.

    The identification is generally historical, where we know that no anomalies were officially recognized. This is why this approach is not purely unsupervised. It relies on the assumption that the majority of observations are anomaly-free. We can use external information (even labels if available) to achieve a higher quality of the selected subset.

  2. Learn what "normal" means from this training dataset. The trained model will provide a sort of metric in its mathematical definition; that is, a function mapping every point to a real number representing the distance from another point representing...

H2O


Before we deep dive into the examples, let's spend some time justifying our decision of using H2O as our deep learning framework for anomaly detection.

H2O is not just a library or package to install. It is an open source, rich analytics platform that provides both machine learning algorithms and high-performance parallel computing abstractions.

H2O core technology is built around a Java Virtual Machine optimized for in-memory processing of distributed data collections.

The platform is usable via a web-based UI or programmatically in many languages, such as Python, R, Java, Scala, and JSON in a REST API.

Data can be loaded from many common data sources, such as HDFS, S3, most of the popular RDBMSes, and a few other NoSQL databases.

After loading, data is represented in an H2OFrame, making it familiar to people used to working with R, Spark, and Python pandas data frames.

The backend can then be switched among different engines. It can run locally in your machine or it can be deployed in a...

Examples


The following examples are proof-of-concepts of how to apply auto-encoders to identify anomalies. Specific tuning and advanced design considerations are out of the scope for this chapter. We will take for granted some results from the literature without going into too much theoretical ground, which has already been covered in previous chapters.

We recommend the reader to carefully read Chapter 4, Unsupervised Feature Learning and the corresponding sections regarding auto-encoders.

We will use a Jupyter notebook for our examples.

Alternatively, we could have used H2O Flow (http://www.h2o.ai/product/flow/), which is a notebook-style UI for H2O pretty much like Jupyter, but we did not want to confuse the reader throughout the book.

We also assume that the reader has a basic idea of how the H2O framework, pandas, and related plotting libraries (matplotlib and seaborn) work.

In the code, we often convert an H2OFrame instance into a pandas.DataFrame so that we can use the standard plotting...

Summary


Anomaly detection is a very common problem that can be found in many applications.

At the start of this chapter, we described a few possible use cases and highlighted the major types and differences according to the context and application requirements.

We briefly covered some of the popular techniques for solving anomaly detection using shallow machine learning algorithms. The major differences can be found in the way features are generated. In shallow machine learning, this is generally a manual task, also called feature engineering. The advantage of using deep learning is that it can automatically learn smart data representations in an unsupervised fashion. Good data representations can substantially help the detection model to spot anomalies.

We have provided an overview of H2O and summarized its functionalities for deep learning, in particular the auto-encoders.

We have implemented a couple of proof-of-concept examples in order to learn how to apply auto-encoders for solving anomaly...

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Python Deep Learning
Published in: Apr 2017Publisher: PacktISBN-13: 9781786464453
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Authors (4)

author image
Valentino Zocca

Valentino Zocca has a PhD degree and graduated with a Laurea in mathematics from the University of Maryland, USA, and University of Rome, respectively, and spent a semester at the University of Warwick. He started working on high-tech projects of an advanced stereo 3D Earth visualization software with head tracking at Autometric, a company later bought by Boeing. There he developed many mathematical algorithms and predictive models, and using Hadoop he automated several satellite-imagery visualization programs. He has worked as an independent consultant at the U.S. Census Bureau, in the USA and in Italy. Currently, Valentino lives in New York and works as an independent consultant to a large financial company.
Read more about Valentino Zocca

author image
Gianmario Spacagna

Gianmario Spacagna is a senior data scientist at Pirelli, processing sensors and telemetry data for internet of things (IoT) and connected-vehicle applications. He works closely with tire mechanics, engineers, and business units to analyze and formulate hybrid, physics-driven, and data-driven automotive models. His main expertise is in building ML systems and end-to-end solutions for data products. He holds a master's degree in telematics from the Polytechnic of Turin, as well as one in software engineering of distributed systems from KTH, Stockholm. Prior to Pirelli, he worked in retail and business banking (Barclays), cyber security (Cisco), predictive marketing (AgilOne), and did some occasional freelancing.
Read more about Gianmario Spacagna

author image
Daniel Slater

Daniel Slater started programming at age 11, developing mods for the id Software game Quake. His obsession led him to become a developer working in the gaming industry on the hit computer game series Championship Manager. He then moved into finance, working on risk- and high-performance messaging systems. He now is a staff engineer working on big data at Skimlinks to understand online user behavior. He spends his spare time training AI to beat computer games. He talks at tech conferences about deep learning and reinforcement learning; and the name of his blog is Daniel Slater's blog. His work in this field has been cited by Google.
Read more about Daniel Slater

author image
Peter Roelants

Peter Roelants holds a master's in computer science with a specialization in AI from KU Leuven. He works on applying deep learning to a variety of problems, such as spectral imaging, speech recognition, text understanding, and document information extraction. He currently works at Onfido as a team leader for the data extraction research team, focusing on data extraction from official documents.
Read more about Peter Roelants