You're reading from Hands-On Data Analysis with Pandas - Second Edition

Product type Book

Published in Apr 2021

Publisher Packt

ISBN-13 9781800563452

Pages 788 pages

Edition 2nd Edition

Languages

Python

Concepts

Databases

Author (1):

Stefanie Molin

Table of Contents (21) Chapters

Preface

Section 1: Getting Started with Pandas

Chapter 1: Introduction to Data Analysis

Chapter 2: Working with Pandas DataFrames

Section 2: Using Pandas for Data Analysis

Chapter 3: Data Wrangling with Pandas

Chapter 4: Aggregating Pandas DataFrames

Chapter 5: Visualizing Data with Pandas and Matplotlib

Chapter 6: Plotting with Seaborn and Customization Techniques

Section 3: Applications – Real-World Analyses Using Pandas

Chapter 7: Financial Analysis – Bitcoin and the Stock Market

Chapter 8: Rule-Based Anomaly Detection

Section 4: Introduction to Machine Learning with Scikit-Learn

Chapter 9: Getting Started with Machine Learning in Python

Chapter 10: Making Better Predictions – Optimizing Models

Chapter 11: Machine Learning Anomaly Detection

Section 5: Additional Resources

Chapter 12: The Road Ahead

Solutions

Other Books You May Enjoy

Appendix

Chapter 8: Rule-Based Anomaly Detection

It's time to catch some hackers trying to gain access to a website using a brute-force attack—trying to log in with a bunch of username-password combinations until they gain access. This type of attack is very noisy, so it gives us plenty of data points for anomaly detection, which is the process of looking for data generated from a process other than the one we deem to be typical activity. The hackers will be simulated and won't be as crafty as they can be in real life, but it will give us great exposure to anomaly detection.

We will be creating a package that will handle the simulation of the login attempts in order to generate the data for this chapter. Knowing how to simulate is an essential skill to have in our toolbox. Sometimes, it's difficult to solve a problem with an exact mathematical solution; however, it might be easy to define how small components of the system work. In these cases, we can model the small...

Chapter materials

We will be building a simulation package to generate the data for this chapter; it is on GitHub at https://github.com/stefmolin/login-attempt-simulator/tree/2nd_edition. This package was installed from GitHub when we set up our environment back in Chapter 1, Introduction to Data Analysis; however, you can follow the instructions in Chapter 7, Financial Analysis – Bitcoin and the Stock Market, to install a version of the package that you can edit.

The repository for this chapter, which can be found at https://github.com/stefmolin/Hands-On-Data-Analysis-with-Pandas-2nd-edition/tree/master/ch_08, has the notebook we will use for our actual analysis (anomaly_detection.ipynb), the data files we will be working with in the logs/ folder, the data used for the simulation in the user_data/ folder, and the simulate.py file, which contains a Python script that we can run on the command line to simulate the data for the chapter.

Simulating login attempts

Since we can't easily find login attempt data from a breach (it's not typically shared due to its sensitive nature), we will be simulating it. Simulation requires a strong understanding of statistical modeling, estimating probabilities of certain events, and identifying appropriate assumptions to simplify where necessary. In order to run the simulation, we will build a Python package (login_attempt_simulator) to simulate a login process requiring a correct username and password (without any extra authentication measures, such as two-factor authentication) and a script (simulate.py) that can be run on the command line, both of which we will discuss in this section.

Assumptions

Before we jump into the code that handles the simulation, we need to understand the assumptions. It is impossible to control for every possible variable when we make a simulation, so we must identify some simplifying assumptions to get started.

The simulator makes the...

Exploratory data analysis

In this scenario, we have the benefit of access to labeled data (logs/attacks.csv) and will use it to investigate how to distinguish between valid users and attackers. However, this is a luxury that we often don't have, especially once we leave the research phase and enter the application phase. In Chapter 11, Machine Learning Anomaly Detection, we will revisit this scenario, but begin without the labeled data for more of a challenge. As usual, we start with our imports and reading in the data:

>>> %matplotlib inline
>>> import matplotlib.pyplot as plt
>>> import numpy as np
>>> import pandas as pd
>>> import seaborn as sns
>>> log = pd.read_csv(
...     'logs/log.csv', index_col='datetime', parse_dates=True
... )

The login attempts dataframe (log) contains the date and time of each attempt in the datetime column, the IP address it came from (source_ip...

Implementing rule-based anomaly detection

It's time to catch those hackers. After the EDA in the previous section, we have an idea of how we might go about this. In practice, this is much more difficult to do, as it involves many more dimensions, but we have simplified it here. We want to find the IP addresses with excessive amounts of attempts accompanied by low success rates, and those attempting to log in with more unique usernames than we would deem normal (anomalies). To do this, we will employ threshold-based rules as our first foray into anomaly detection; then, in Chapter 11, Machine Learning Anomaly Detection, we will explore a few machine learning techniques as we revisit this scenario.

Since we are interested in flagging IP addresses that are suspicious, we are going to arrange the data so that we have hourly aggregated data per IP address (if there was activity for that hour):

>>> hourly_ip_logs = log.assign(
...     failures...

Summary

In our second application chapter, we learned how to simulate events in Python and got additional exposure to writing packages. We also saw how to write Python scripts that can be run from the command line, which we used to run our simulation of the login attempt data. Then, we performed some EDA on the simulated data to see whether we could figure out what would make hacker activity easy to spot.

This led us to zero in on the number of distinct usernames attempting to authenticate per IP address per hour, as well as the number of attempts and failure rates. Using these metrics, we were able to create a scatter plot, which appeared to show two distinct groups of points, along with some other points connecting the two groups; naturally, these represented the groups of valid users and the nefarious ones, with some of the hackers not being as obvious as others.

Finally, we set about creating rules that would flag the hacker IP addresses for their suspicious activity. First...

Exercises

Complete the following exercises to practice the concepts covered in this chapter:

Run the simulation for December 2018 into new log files without making the user base again. Be sure to run python3 simulate.py -h to review the command-line arguments. Set the seed to 27. This data will be used for the remaining exercises.
Find the number of unique usernames, attempts, successes, and failures, as well as the success/failure rates per IP address, using the data simulated from exercise 1.
Create two subplots with failures versus attempts on the left, and failure rate versus distinct usernames on the right. Draw decision boundaries for the resulting plots. Be sure to color each data point by whether or not it is a hacker IP address.
Build a rule-based criteria using the percentage difference from the median that flags an IP address if the failures and attempts are both five times their respective medians, or if the distinct usernames count is five times its...