Packt+ | Advance your knowledge in tech

You're reading from Python Deep Learning

Product type Book

Published in Apr 2017

Publisher Packt

ISBN-13 9781786464453

Pages 406 pages

Edition 1st Edition

Languages

Python

Concepts

Deep Learning

Authors (4):

Valentino Zocca

Gianmario Spacagna

Daniel Slater

Peter Roelants

View More author details

Table of Contents (18) Chapters

Python Deep Learning

Credits

About the Authors

About the Reviewer

www.PacktPub.com

Customer Feedback

Preface

Machine Learning – An Introduction

Neural Networks

Deep Learning Fundamentals

Unsupervised Feature Learning

Image Recognition

Recurrent Neural Networks and Language Models

Deep Learning for Board Games

Deep Learning for Computer Games

Anomaly Detection

Building a Production-Ready Intrusion Detection System

Index

Chapter 10. Building a Production-Ready Intrusion Detection System

In the previous chapter, we explained in detail what an anomaly detection is and how it can be implemented using auto-encoders. We proposed a semi-supervised approach for novelty detection. We introduced H2O and showed a couple of examples (MNIST digit recognition and ECG pulse signals) implemented on top of the framework and running in local mode. Those examples used a small dataset already cleaned and prepared to be used as proof-of-concept.

Real-world data and enterprise environments work very differently. In this chapter, we will leverage H2O and general common practices to build a scalable distributed system ready for deployment in production.

We will use as an example an intrusion detection system with the goal of detecting intrusions and attacks in a network environment.

We will raise a few practical and technical issues that you would probably face in building a data product for intrusion detection.

In particular, you...

What is a data product?

The final goal in data science is to solve problems by adopting data-intensive solutions. The focus is not only on answering questions but also on satisfying business requirements.

Just building data-driven solutions is not enough. Nowadays, any app or website is powered by data. Building a web platform for listing items on sale does consume data but is not necessarily a data product.

Mike Loukides gives an excellent definition:

A data application acquires its value from the data itself, and creates more data as a result; it's not just an application with data; it's a data product. Data science enables the creation of data products.
From "What is Data Science" (https://www.oreilly.com/ideas/what-is-data-science)

The fundamental requirement is that the system is able to derive value from data—not just consuming it as it is—and generate knowledge (in the form of data or insights) as output. A data product is the automation that let you extract information from raw data,...

Training

Training a network means having already designed its topology. For that purpose we recommend the corresponding Auto-Encoder section in Chapter 4, Unsupervised Feature Learning for design guidelines according to the type of input data and expected use cases.

Once we have defined the topology of the neural network, we are just at the starting point. The model now needs to be fitted during the training phase. We will see a few techniques for scaling and accelerating the learning of our training algorithm that are very suitable for production environments with large datasets.

Weights initialization

The final convergence of neural networks can be strongly influenced by the initial weights. Depending on which activation function we have selected, we would like to have a gradient with a steep slope in the first iterations so that the gradient descent algorithm can quickly jump into the optimum area.

For a hidden unit j in the first layer (directly connected to the input layer), the sum of...

Testing

Before we discuss what testing means in data science, let's summarize a few concepts.

Firstly and in general, what is a model in science? We can cite the following definitions:

In science, a model is a representation of an idea, an object or even a process or a system that is used to describe and explain phenomena that cannot be experienced directly.
Scientific Modelling, Science Learning Hub, http://sciencelearn.org.nz/Contexts/The-Noisy-Reef/Science-Ideas-and-Concepts/Scientific-modelling

And this:

A scientific model is a conceptual, mathematical or physical representation of a real-world phenomenon. A model is generally constructed for an object or process when it is at least partially understood, but difficult to observe directly. Examples include sticks and balls representing molecules, mathematical models of planetary movements or conceptual principles like the ideal gas law. Because of the infinite variations actually found in nature, all but the simplest and most vague models...

Model validation

The goal of model validation is to evaluate whether the numerical results quantifying the hypothesized estimations/predictions of the trained model are acceptable descriptions of an independent dataset. The main reason is that any measure on the training set would be biased and optimistic since the model has already seen those observations. If we don't have a different dataset for validation, we can hold one fold of the data out from training and use it as benchmark. Another common technique is the cross-fold validation, and its stratified version, where the whole historical dataset is split into multiple folds. For simplicity, we will discuss the hold-one-out method; the same criteria apply also to the cross-fold validation.

The splitting into training and validation set cannot be purely random. The validation set should represent the future hypothetical scenario in which we will use the model for scoring. It is important not to contaminate the validation set with information...

Hyper-parameters tuning

Following the design of our deep neural network according to the previous sections, we would end up with a bunch of parameters to tune. Some of them have default or recommended values and do not require expensive fine-tuning. Others strongly depends on the underlying data, specific application domain, and a set of other components. Thus, the only way to find best values is to perform a model selection by validating based on the desired metric computed on the validation data fold.

Now we will list a table of parameters that we might want to consider tuning. Please consider that each library or framework may have additional parameters and a custom way of setting them. This table is derived from the available tuning options in H2O. It summarizes the common parameters, but not all of them, when building a deep auto-encoder network in production:

End-to-end evaluation

From a business point of view what really matters is the final end-to-end performance. None of your stakeholders will be interested in your training error, parameters tuning, model selection, and so on. What matters is the KPIs to compute on top of the final model. Evaluation can be seen as the ultimate verdict.

Also, as we anticipated, evaluating a product cannot be done with a single metric. Generally, it is a good and effective practice to build an internal dashboard that can report, or measure in real-time, a bunch of performance indicators of our product in the form of aggregated numbers or easy-to-interpret visualization charts. Within a single glance, we would like to understand the whole picture and translate it in the value we are generating within the business.

The evaluation phase can, and generally does, include the same methodology as the model validation. We have seen in previous sections a few techniques for validating in case of labeled and unlabeled data...

Deployment

At this stage, we should have done almost all of the analysis and development needed for building an anomaly detector, or in general a data product using deep learning.

We are only left with final, but not less important, step: the deployment.

Deployment is generally very specific of the use case and enterprise infrastructure. In this section, we will cover some common approaches used in general data science production systems.

POJO model export

In the Testing section, we summarized all the different entities in a machine learning pipeline. In particular, we have seen the definition and differences of a model, a fitted model and the learning algorithm. After we have trained, validated, and selected the final model, we have a final fitted version of it ready to be used. During the testing phase (except in A/B testing), we have scored only historical data that was generally already available in the machines where we trained the model.

In enterprise architectures, it is common to have...

Summary

In this chapter, we went through a long journey of optimizations, tweaks, testing strategies, and engineering practices to turn our neural network into an intrusion detection data product.

In particular, we defined a data product as a system that extracts value from raw data and returns actionable knowledge as output.

We saw a few optimizations for training a deep neural network to be faster, scalable, and more robust. We addressed the problem of early saturation via weights initialization. Scalability using both a parallel multi-threading version of SGD and a distributed implementation in Map/Reduce. We saw how the H2O framework can leverage Apache Spark as the backend for computation via Sparkling Water.

We remarked the importance of testing and the difference between model validation and full end-to-end evaluation. Model validation is used to reject or accept a given model, or to select the best performing one. Likely, model validation metrics can be used for hyper-parameter tuning...

The rest of the chapter is locked

You have been reading a chapter from

Python Deep Learning

Published in: Apr 2017 Publisher: Packt ISBN-13: 9781786464453

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at $15.99/month. Cancel anytime}

Authors (4)

Valentino Zocca

Valentino Zocca has a PhD degree and graduated with a Laurea in mathematics from the University of Maryland, USA, and University of Rome, respectively, and spent a semester at the University of Warwick. He started working on high-tech projects of an advanced stereo 3D Earth visualization software with head tracking at Autometric, a company later bought by Boeing. There he developed many mathematical algorithms and predictive models, and using Hadoop he automated several satellite-imagery visualization programs. He has worked as an independent consultant at the U.S. Census Bureau, in the USA and in Italy. Currently, Valentino lives in New York and works as an independent consultant to a large financial company.

See other products by Valentino Zocca

Gianmario Spacagna

Gianmario Spacagna is a senior data scientist at Pirelli, processing sensors and telemetry data for internet of things (IoT) and connected-vehicle applications. He works closely with tire mechanics, engineers, and business units to analyze and formulate hybrid, physics-driven, and data-driven automotive models. His main expertise is in building ML systems and end-to-end solutions for data products. He holds a master's degree in telematics from the Polytechnic of Turin, as well as one in software engineering of distributed systems from KTH, Stockholm. Prior to Pirelli, he worked in retail and business banking (Barclays), cyber security (Cisco), predictive marketing (AgilOne), and did some occasional freelancing.

See other products by Gianmario Spacagna

Daniel Slater

Daniel Slater started programming at age 11, developing mods for the id Software game Quake. His obsession led him to become a developer working in the gaming industry on the hit computer game series Championship Manager. He then moved into finance, working on risk- and high-performance messaging systems. He now is a staff engineer working on big data at Skimlinks to understand online user behavior. He spends his spare time training AI to beat computer games. He talks at tech conferences about deep learning and reinforcement learning; and the name of his blog is Daniel Slater's blog. His work in this field has been cited by Google.

See other products by Daniel Slater

Peter Roelants

Peter Roelants holds a master's in computer science with a specialization in AI from KU Leuven. He works on applying deep learning to a variety of problems, such as spectral imaging, speech recognition, text understanding, and document information extraction. He currently works at Onfido as a team leader for the data extraction research team, focusing on data extraction from official documents.

See other products by Peter Roelants