Reader small image

You're reading from  Hands-On Data Analysis with Pandas - Second Edition

Product typeBook
Published inApr 2021
Reading LevelIntermediate
PublisherPackt
ISBN-139781800563452
Edition2nd Edition
Languages
Tools
Concepts
Right arrow
Author (1)
Stefanie Molin
Stefanie Molin
author image
Stefanie Molin

Stefanie Molin is a data scientist and software engineer at Bloomberg LP in NYC, tackling tough problems in information security, particularly revolving around anomaly detection, building tools for gathering data, and knowledge sharing. She has extensive experience in data science, designing anomaly detection solutions, and utilizing machine learning in both R and Python in the AdTech and FinTech industries. She holds a B.S. in operations research from Columbia University's Fu Foundation School of Engineering and Applied Science, with minors in economics, and entrepreneurship and innovation. In her free time, she enjoys traveling the world, inventing new recipes, and learning new languages spoken among both people and computers.
Read more about Stefanie Molin

Right arrow

Chapter 11: Machine Learning Anomaly Detection

For our final application chapter, we will be revisiting anomaly detection on login attempts. Let's imagine we work for a company that launched its web application at the beginning of 2018. This web application has been collecting log events for all login attempts since it launched. We know the IP address that the attempt was made from, the result of the attempt, when it was made, and which username was entered. What we don't know is whether the attempt was made by one of our valid users or a nefarious party.

Our company has been expanding and, since data breaches seem to be in the news every day, has created an information security department to monitor the traffic. The CEO saw our rule-based approach to identifying hackers from Chapter 8, Rule-Based Anomaly Detection, and was intrigued by our initiative, but wants us to move beyond using rules and thresholds for such a vital task. We have been tasked with developing a machine...

Chapter materials

The materials for this chapter can be found at https://github.com/stefmolin/Hands-On-Data-Analysis-with-Pandas-2nd-edition/tree/master/ch_11. In this chapter, we will be revisiting attempted login data; however, the simulate.py script has been updated to allow additional command-line arguments. We won't be running the simulation this time, but be sure to take a look at the script and check out the process that was followed to generate the data files and create the database for this chapter in the 0-simulating_the_data.ipynb notebook. The user_data/ directory contains the files used for this simulation, but we won't be using them directly in this chapter.

The simulated log data we will be using for this chapter can be found in the logs/ directory. The logs_2018.csv and hackers_2018.csv files are logs of login attempts and a record of hacker activity from all 2018 simulations, respectively. Files with the hackers prefix are treated as the labeled data we...

Exploring the simulated login attempts data

We don't have labeled data yet, but we can still examine the data to see whether there is something that stands out. This data is different from the data in Chapter 8, Rule-Based Anomaly Detection. The hackers are smarter in this simulation—they don't always try as many users or stick with the same IP address every time. Let's see whether we can come up with some features that will help with anomaly detection by performing some EDA in the 1-EDA_unlabeled_data.ipynb notebook.

As usual, we begin with our imports. These will be the same for all notebooks, so it will be reproduced in this section only:

>>> %matplotlib inline
>>> import matplotlib.pyplot as plt
>>> import numpy as np
>>> import pandas as pd
>>> import seaborn as sns

Next, we read in the 2018 logs from the logs table in the SQLite database:

>>> import sqlite3
>>> with sqlite3.connect...

Utilizing unsupervised methods of anomaly detection

If the hackers are conspicuous and distinct from our valid users, unsupervised methods may prove pretty effective. This is a good place to start before we have labeled data, or if the labeled data is difficult to gather or not guaranteed to be representative of the full spectrum we are looking to flag. Note that, in most cases, we won't have labeled data, so it is crucial that we are familiar with some unsupervised methods.

In our initial EDA, we identified the number of usernames with a failed login attempt in a given minute as a feature for anomaly detection. We will now test out some unsupervised anomaly detection algorithms, using this feature as the jumping-off point. Scikit-learn provides a few such algorithms. In this section, we will look at isolation forest and local outlier factor; a third method, using a one-class support vector machine (SVM), is in the Exercises section.

Before we can try out these methods,...

Implementing supervised anomaly detection

The SOC has finished up labeling the 2018 data, so we should revisit our EDA to make sure our plan of looking at the number of usernames with failures on a minute resolution does separate the data. This EDA is in the 3-EDA_labeled_data.ipynb notebook. After some data wrangling, we are able to create the following scatter plot, which shows that this strategy does indeed appear to separate the suspicious activity:

Figure 11.12 – Confirming that our features can help form a decision boundary

In the 4-supervised_anomaly_detection.ipynb notebook, we will create some supervised models. This time we need to read in all the labeled data for 2018. Note that the code for reading in the logs is omitted since it is the same as in the previous section:

>>> with sqlite3.connect('logs/logs.db') as conn:
...     hackers_2018 = pd.read_sql(
...      ...

Incorporating a feedback loop with online learning

There are some big issues with the models we have built so far. Unlike the data we worked with in Chapter 9, Getting Started with Machine Learning in Python, and Chapter 10, Making Better Predictions – Optimizing Models, we wouldn't expect the attacker behavior to be static over time. There is also a limit to how much data we can hold in memory, which limits how much data we can train our model on. Therefore, we will now build an online learning model to flag anomalies in usernames with failures per minute. An online learning model is constantly getting updated (in near real time via streaming, or in batches). This allows us to learn from new data as it comes and then get rid of it (to keep space in memory).

In addition, the model can evolve over time and adapt to changes in the underlying distribution of the data. We will also be providing our model with feedback as it learns so that we are able to make sure it stays...

Summary

In practice, detecting attackers isn't easy. Real-life hackers are much savvier than the ones in this simulation. Attacks are also much less frequent, creating a huge class imbalance. Building machine learning models that will catch everything just isn't possible. That is why it is so vital that we work with those who have domain knowledge; they can help us squeeze some extra performance out of our models by really understanding the data and its peculiarities. No matter how experienced we become with machine learning, we should never turn down help from someone who often works with the data in question.

Our initial attempts at anomaly detection were unsupervised while we waited for the labeled data from our subject matter experts. We tried LOF and isolation forest using scikit-learn. Once we received the labeled data and performance requirements from our stakeholders, we determined that the isolation forest model was better for our data.

However, we didn&apos...

Exercises

Complete the following exercises for some practice with the machine learning workflow and exposure to some additional anomaly detection strategies:

  1. A one-class SVM is another model that can be used for unsupervised outlier detection. Build a one-class SVM with the default parameters, using a pipeline with a StandardScaler object followed by a OneClassSVM object. Train the model on the January 2018 data, just as we did for the isolation forest. Make predictions on that same data. Count the number of inliers and outliers this model identifies.
  2. Using the 2018 minutely data, build a k-means model with two clusters after standardizing the data with a StandardScaler object. With the labeled data in the attacks table in the SQLite database (logs/logs.db), see whether this model gets a good Fowlkes-Mallows score (use the fowlkes_mallows_score() function in sklearn.metrics).
  3. Evaluate the performance of a random forest classifier for supervised anomaly detection. Set...

Further reading

Check out the following resources for more information on the topics covered in this chapter:

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Hands-On Data Analysis with Pandas - Second Edition
Published in: Apr 2021Publisher: PacktISBN-13: 9781800563452
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Stefanie Molin

Stefanie Molin is a data scientist and software engineer at Bloomberg LP in NYC, tackling tough problems in information security, particularly revolving around anomaly detection, building tools for gathering data, and knowledge sharing. She has extensive experience in data science, designing anomaly detection solutions, and utilizing machine learning in both R and Python in the AdTech and FinTech industries. She holds a B.S. in operations research from Columbia University's Fu Foundation School of Engineering and Applied Science, with minors in economics, and entrepreneurship and innovation. In her free time, she enjoys traveling the world, inventing new recipes, and learning new languages spoken among both people and computers.
Read more about Stefanie Molin