You're reading from 10 Machine Learning Blueprints You Should Know for Cybersecurity

Product type Book

Published in May 2023

Publisher Packt

ISBN-13 9781804619476

Pages 330 pages

Edition 1st Edition

Languages

Concepts

Machine Learning

Author (1):

Rajvardhan Oak

Table of Contents (15) Chapters

Preface

Chapter 1: On Cybersecurity and Machine Learning

Chapter 2: Detecting Suspicious Activity

Chapter 3: Malware Detection Using Transformers and BERT

Chapter 4: Detecting Fake Reviews

Chapter 5: Detecting Deepfakes

Chapter 6: Detecting Machine-Generated Text

Chapter 7: Attributing Authorship and How to Evade It

Chapter 8: Detecting Fake News with Graph Neural Networks

Chapter 9: Attacking Models with Adversarial Machine Learning

Chapter 10: Protecting User Privacy with Differential Privacy

Chapter 11: Protecting User Privacy with Federated Machine Learning

Chapter 12: Breaking into the Sec-ML Industry

Index

Why subscribe?

Other Books You May Enjoy

Basics of anomaly detection

In this section, we will look at anomaly detection, which forms the foundation for detecting intrusions and suspicious activity.

What is anomaly detection?

The word anomaly means something that deviates from what is standard, normal, or expected. Anomalies are events or data points that do not fit in with the rest of the data. They represent deviations from the expected trend in data. Anomalies are rare occurrences and, therefore, few in number.

For example, consider a bot or fraud detection model used in a social media website such as Twitter. If we examine the number of follow requests sent to a user per day, we can get a general sense of the trend and plot this data. Let’s say that we plotted this data for a month, and ended up with the following trend:

Figure 2.1 – Trend for the number of follow requests over a month

What do you notice? The user seems to have roughly 30-40 follow requests per day. On the 8th and 18th days, however, we see a spike that clearly stands out from the daily trend. These two days are anomalies.

Anomalies can also be visually observed in a two-dimensional space. If we plot all the points in the dataset, the anomalies should stand out as being different from the others. For instance, continuing with the same example, let us say we have a number of features such as the number of messages sent, likes, retweets, and so on by a user. Using all of the features together, we can construct an n-dimensional feature vector for a user. By applying a dimensionality reduction algorithm such as principal component analysis (PCA) (at a high level, this algorithm can convert data to lower dimensions and still retain the properties), we can reduce it to two dimensions and plot the data. Say we get a plot as follows, where each point represents a user, and the dimensions represent principal components of the original data. The points colored in red clearly stand out from the rest of the data—these are outliers:

Figure 2.2 – A 2D representation of data with anomalies in red

Note that anomalies do not necessarily represent a malicious event—they simply indicate that the trend deviates from what was normally expected. For example, a user suddenly receiving increased amounts of friend requests is anomalous, but this may have been because they posted some very engaging content. Anomalies, when flagged, must be investigated to determine whether they are malicious or benign.

Anomaly detection is considered an important problem in the field of cybersecurity. Unusual or abnormal events can often indicate security breaches or attacks. Furthermore, anomaly detection does not need labeled data, which is hard to come by in security problems.

Introducing the NSL-KDD dataset

Now that we have introduced what anomaly detection is in sufficient detail, we will look at a real-world dataset that will help us observe and detect anomalies in action.

The data

Before we jump into any algorithms for anomaly detection, let us talk about the dataset we will be using in this chapter. The dataset that is popularly used for anomaly and intrusion detection tasks is the Network Security Laboratory-Knowledge Discovery in Databases (NSL-KDD) dataset. This was originally created in 1999 for use in a competition at the 5th International Conference on Knowledge Discovery and Data Mining (KDD). The task in the competition was to develop a network intrusion detector, which is a predictive model that can distinguish between bad connections, called intrusions or attacks, and benign normal connections. This database contains a standard set of data to be audited, which includes a wide variety of intrusions simulated in a military network environment.

Exploratory data analysis (EDA)

This activity consists of a few steps, which we will look at in the next subsections.

Downloading the data

The actual NSL-KDD dataset is fairly large (nearly 4 million records). We will be using a smaller version of the data that is a 10% subset randomly sampled from the whole data. This will make our analysis feasible. You can, of course, experiment by downloading the full data and rerunning our experiments.

First, we import the necessary Python libraries:

import pandas as pd
import numpy as np
import os
from requests import get

Then, we set the paths for the locations of training and test data, as well as the paths to a label file that holds a header (names of features) for the data:

train_data_page = "http://kdd.ics.uci.edu/databases/kddcup99/kddcup.data_10_percent.gz"
test_data_page = "http://kdd.ics.uci.edu/databases/kddcup99/kddcup.testdata.unlabeled_10_percent.gz"
labels ="http://kdd.ics.uci.edu/databases/kddcup99/kddcup.names"
datadir = "data"

Next, we download the data and column names using the wget command through Python. As these files are zipped (compressed), we have to first extract the contents using the gunzip command. The following Python code snippet does that for us:

# Download training data
print("Downloading Training Data")
os.system("wget " + train_data_page)
training_file_name = train_data_page.split("/")[-1].replace(".gz","")
os.system("gunzip " + training_file_name )
with open(training_file_name, "r+") as ff:
  lines = [i.strip().split(",") for i in ff.readlines()]
ff.close()
# Download training column labels
print("Downloading Training Labels")
response = get(labels)
labels = response.text
labels = [i.split(",")[0].split(":") for i in labels.split("\n")]
labels = [i for i in labels if i[0]!='']
final_labels = labels[1::]

Finally, we construct a DataFrame from the downloaded streams:

data = pd.DataFrame(lines)
labels = final_labels
data.columns = [i[0] for i in labels]+['target']
for i in range(len(labels)):
  if labels[i][1] == ' continuous.':
    data.iloc[:,i] = data.iloc[:,i].astype(float)

This completes our step of downloading the data and creating a DataFrame from it. A DataFrame is a tabular data structure that will allow us to manipulate, slice and dice, and filter the data as needed.

Understanding the data

Once the data is downloaded, you can have a look at the DataFrame simply by printing the top five rows:

data.head()

This should give you an output just like this:

Figure 2.3 – Top five rows from the NSL-KDD dataset

As you can see, the top five rows of the data are displayed. The dataset has 42 columns. The last column, named target, identifies the kind of network attack for every row in the data. To examine the distribution of network attacks (that is, how many examples of each kind of attack are present), we can run the following statement:

data['target'].value_counts()

This will list all network attacks and the count (number of rows) for each attack, as follows:

Figure 2.4 – Distribution of data by label (attack type)

We can see that there are a variety of attack types present in the data, with the smurf and neptune types accounting for the largest part. Next, we will look at how to model this data using statistical algorithms.