Search icon
Arrow left icon
All Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletters
Free Learning
Arrow right icon
Hands-On Data Analysis with Pandas - Second Edition

You're reading from  Hands-On Data Analysis with Pandas - Second Edition

Product type Book
Published in Apr 2021
Publisher Packt
ISBN-13 9781800563452
Pages 788 pages
Edition 2nd Edition
Languages
Concepts
Author (1):
Stefanie Molin Stefanie Molin
Profile icon Stefanie Molin

Table of Contents (21) Chapters

Preface Section 1: Getting Started with Pandas
Chapter 1: Introduction to Data Analysis Chapter 2: Working with Pandas DataFrames Section 2: Using Pandas for Data Analysis
Chapter 3: Data Wrangling with Pandas Chapter 4: Aggregating Pandas DataFrames Chapter 5: Visualizing Data with Pandas and Matplotlib Chapter 6: Plotting with Seaborn and Customization Techniques Section 3: Applications – Real-World Analyses Using Pandas
Chapter 7: Financial Analysis – Bitcoin and the Stock Market Chapter 8: Rule-Based Anomaly Detection Section 4: Introduction to Machine Learning with Scikit-Learn
Chapter 9: Getting Started with Machine Learning in Python Chapter 10: Making Better Predictions – Optimizing Models Chapter 11: Machine Learning Anomaly Detection Section 5: Additional Resources
Chapter 12: The Road Ahead Solutions
Other Books You May Enjoy Appendix

Chapter 8: Rule-Based Anomaly Detection

It's time to catch some hackers trying to gain access to a website using a brute-force attack—trying to log in with a bunch of username-password combinations until they gain access. This type of attack is very noisy, so it gives us plenty of data points for anomaly detection, which is the process of looking for data generated from a process other than the one we deem to be typical activity. The hackers will be simulated and won't be as crafty as they can be in real life, but it will give us great exposure to anomaly detection.

We will be creating a package that will handle the simulation of the login attempts in order to generate the data for this chapter. Knowing how to simulate is an essential skill to have in our toolbox. Sometimes, it's difficult to solve a problem with an exact mathematical solution; however, it might be easy to define how small components of the system work. In these cases, we can model the small...

Chapter materials

We will be building a simulation package to generate the data for this chapter; it is on GitHub at https://github.com/stefmolin/login-attempt-simulator/tree/2nd_edition. This package was installed from GitHub when we set up our environment back in Chapter 1, Introduction to Data Analysis; however, you can follow the instructions in Chapter 7, Financial Analysis – Bitcoin and the Stock Market, to install a version of the package that you can edit.

The repository for this chapter, which can be found at https://github.com/stefmolin/Hands-On-Data-Analysis-with-Pandas-2nd-edition/tree/master/ch_08, has the notebook we will use for our actual analysis (anomaly_detection.ipynb), the data files we will be working with in the logs/ folder, the data used for the simulation in the user_data/ folder, and the simulate.py file, which contains a Python script that we can run on the command line to simulate the data for the chapter.

Simulating login attempts

Since we can't easily find login attempt data from a breach (it's not typically shared due to its sensitive nature), we will be simulating it. Simulation requires a strong understanding of statistical modeling, estimating probabilities of certain events, and identifying appropriate assumptions to simplify where necessary. In order to run the simulation, we will build a Python package (login_attempt_simulator) to simulate a login process requiring a correct username and password (without any extra authentication measures, such as two-factor authentication) and a script (simulate.py) that can be run on the command line, both of which we will discuss in this section.

Assumptions

Before we jump into the code that handles the simulation, we need to understand the assumptions. It is impossible to control for every possible variable when we make a simulation, so we must identify some simplifying assumptions to get started.

The simulator makes the...

Exploratory data analysis

In this scenario, we have the benefit of access to labeled data (logs/attacks.csv) and will use it to investigate how to distinguish between valid users and attackers. However, this is a luxury that we often don't have, especially once we leave the research phase and enter the application phase. In Chapter 11, Machine Learning Anomaly Detection, we will revisit this scenario, but begin without the labeled data for more of a challenge. As usual, we start with our imports and reading in the data:

>>> %matplotlib inline
>>> import matplotlib.pyplot as plt
>>> import numpy as np
>>> import pandas as pd
>>> import seaborn as sns
>>> log = pd.read_csv(
...     'logs/log.csv', index_col='datetime', parse_dates=True
... )

The login attempts dataframe (log) contains the date and time of each attempt in the datetime column, the IP address it came from (source_ip...

Implementing rule-based anomaly detection

It's time to catch those hackers. After the EDA in the previous section, we have an idea of how we might go about this. In practice, this is much more difficult to do, as it involves many more dimensions, but we have simplified it here. We want to find the IP addresses with excessive amounts of attempts accompanied by low success rates, and those attempting to log in with more unique usernames than we would deem normal (anomalies). To do this, we will employ threshold-based rules as our first foray into anomaly detection; then, in Chapter 11, Machine Learning Anomaly Detection, we will explore a few machine learning techniques as we revisit this scenario.

Since we are interested in flagging IP addresses that are suspicious, we are going to arrange the data so that we have hourly aggregated data per IP address (if there was activity for that hour):

>>> hourly_ip_logs = log.assign(
...     failures...

Summary

In our second application chapter, we learned how to simulate events in Python and got additional exposure to writing packages. We also saw how to write Python scripts that can be run from the command line, which we used to run our simulation of the login attempt data. Then, we performed some EDA on the simulated data to see whether we could figure out what would make hacker activity easy to spot.

This led us to zero in on the number of distinct usernames attempting to authenticate per IP address per hour, as well as the number of attempts and failure rates. Using these metrics, we were able to create a scatter plot, which appeared to show two distinct groups of points, along with some other points connecting the two groups; naturally, these represented the groups of valid users and the nefarious ones, with some of the hackers not being as obvious as others.

Finally, we set about creating rules that would flag the hacker IP addresses for their suspicious activity. First...

Exercises

Complete the following exercises to practice the concepts covered in this chapter:

  1. Run the simulation for December 2018 into new log files without making the user base again. Be sure to run python3 simulate.py -h to review the command-line arguments. Set the seed to 27. This data will be used for the remaining exercises.
  2. Find the number of unique usernames, attempts, successes, and failures, as well as the success/failure rates per IP address, using the data simulated from exercise 1.
  3. Create two subplots with failures versus attempts on the left, and failure rate versus distinct usernames on the right. Draw decision boundaries for the resulting plots. Be sure to color each data point by whether or not it is a hacker IP address.
  4. Build a rule-based criteria using the percentage difference from the median that flags an IP address if the failures and attempts are both five times their respective medians, or if the distinct usernames count is five times its...

Further reading

Check out the following resources for more information on the topics covered in this chapter:

lock icon The rest of the chapter is locked
You have been reading a chapter from
Hands-On Data Analysis with Pandas - Second Edition
Published in: Apr 2021 Publisher: Packt ISBN-13: 9781800563452
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime}