Machine learning algorithms have helped solve real-world problems as diverse as disease prediction and online shopping. However, many problems we would like to address with machine learning involve imbalanced datasets. In this chapter, we will discuss and define imbalanced datasets, explaining how they differ from other types of datasets. The ubiquity of imbalanced data will be demonstrated with examples of common problems and scenarios. We will also go through the basics of machine learning and cover the essentials, such as loss functions, regularization, and feature engineering. We will also learn about common evaluation metrics, particularly those that can be very helpful for imbalanced datasets. We will then introduce the imbalanced-learn
library.
In particular, we will learn about the following topics:
imbalanced-learn
libraryIn this chapter, we will utilize common libraries such as numpy
and scikit-learn
and introduce the imbalanced-learn
library. The code and notebooks for this chapter are available on GitHub at https://github.com/PacktPublishing/Machine-Learning-for-Imbalanced-Data/tree/main/chapter01. You can fire up the GitHub notebook using Google Colab by clicking on the Open in Colab icon at the top of this chapter’s notebook or by launching it from https://colab.research.google.com using the GitHub URL of the notebook.
Machine learning algorithms learn from collections of examples that we call datasets. These datasets contain multiple data samples or points, which we may refer to as examples, samples, or instances interchangeably throughout this book.
A dataset can be said to have a balanced distribution when all the target classes have a similar number of examples, as shown in Figure 1.1:
Figure 1.1 – Balanced distribution with an almost equal number of examples for each class
Imbalanced datasets or skewed datasets are those that have some target classes (also called labels) that outnumber the rest of the classes (Figure 1.2). Though this generally applies to classification problems (for example, fraud detection) in machine learning, they inevitably occur in regression problems (for example, house price prediction) too:
Figure 1.2 – An imbalanced dataset with five classes and a varying number of samples
We label the class with more instances as the “majority” or “negative” class and the one with fewer instances as the “minority” or “positive” class. Most of the time, our main interest lies in the minority class, which is why we often refer to the minority class as the “positive” class and to the majority class as the “negative” class:
Figure 1.3 – A visual guide to common terminology used in imbalanced classification
This can be scaled to more than two classes, and such classification problems are called multi-class classification. In the first half of this book, we will focus our attention only on binary class classification to keep the material easier to grasp. It’s relatively easy to extend the concepts to multi-class classification.
Let’s look at a few examples of imbalanced datasets:
In this book, we would like to focus on the class imbalance problem in general and look at various solutions where we see that class imbalance is affecting the performance of our model. A typical problem is that models perform quite poorly on the minority classes for which the model has seen a very low number of examples during model training.
Let’s do a quick overview of machine learning and its related fields:
In supervised learning (which is the focus of this book), there are two main types of problems: classification and regression. Classification problems involve categorizing data into predefined classes or labels, such as “fraud” or “non-fraud” and “spam” or “non-spam.” On the other hand, regression problems aim to predict a continuous variable, such as the price of a house.
While data imbalance can also affect regression problems, this book will concentrate solely on classification problems. This focus is due to several factors, such as the limited scope of this book and the well-established techniques available for classification. In some cases, you might even be able to reframe a regression problem as a classification problem, making the methods discussed in this book still relevant.
When it comes to various kinds of models that are popular for classification problems, we have quite a few categories of classical supervised machine learning models:
Figure 1.4 displays the decision boundaries of various classifiers we have reviewed so far. It shows that logistic regression has a linear decision boundary, while tree-based models such as decision trees, random forests, and XGBoost work by dividing examples into axis-parallel rectangles to form their decision boundary. SVM, on the other hand, transforms the data to a different space so that it can plot its non-linear decision boundary. Neural networks have a non-linear decision boundary:
Figure 1.4 – The decision boundaries of popular machine learning algorithms on an imbalanced dataset
Next, we’ll delve into the principles underlying the process of model training.
In the training phase of a machine learning model, we provide a dataset consisting of examples, each with input features and a corresponding label, to the model. Let represent the list of features used for training, and be the list of labels in the training dataset. The goal of the model is to learn a function, , such that .
The model has adjustable parameters, denoted as θ, which are fine-tuned during the training process. The error function, commonly referred to as the loss function, is defined as . This error function needs to be minimized by a learning algorithm, which finds the optimal setting of these parameters, .
In classification problems, our typical loss functions are cross-entropy loss (also called the log loss):
Here, p is the predicted probability from the model when .
When the model’s prediction closely agrees with the target label, the loss function will approach zero. However, when the prediction deviates significantly from the target, the loss can become arbitrarily large, indicating a poor model fit.
As training progresses, the training loss keeps going down (Figure 1.5):
Figure 1.5 – Rate of change of the loss function as training progresses
This brings us to the concept of the fit of a model:
Figure 1.6 – Underfit, right fit, and overfit models for classification task
Next, let’s briefly try to learn about two important concepts in machine learning:
Typically, we train our model on the training set and test the model on an independent unseen dataset called the test set. We do this to do a fair evaluation of the model. If we don’t do this and train the model on the full dataset and evaluate the model on the same dataset, we don’t know how good the model would do on unseen data, plus the model will likely be overfitted.
We may encounter three kinds of datasets in machine learning:
When working with small example datasets, it’s common to allocate 80% of the data for the training set, 10% for the validation set, and 10% for the test set. However, the specific ratio between training and test sets is not as important as ensuring that the test set is large enough to provide statistically meaningful evaluation results. In the context of big data, a split of 98%, 1%, and 1% for training, validation, and test sets, respectively, could be appropriate.
Often, people don’t have a dedicated validation set for hyperparameter tuning and refer to the test set as an evaluation set. This can happen when the hyperparameter tuning is not performed as a part of the regular training cycle and is a one-off activity.
Cross-validation can be a confusing term to guess its meaning. Breaking it down: cross + validation, so it’s some sort of validation on an extended (cross) something. Something here is the test set for us.
Let’s see what cross-validation is:
Let’s look at the different types of cross-validation:
k-fold cross-validation is mainly used when you have limited data points, say 100 points. Using 5 or 10 folds is the most common when doing cross-validation.
Let’s look at the common evaluation metrics in machine learning, with a special focus on the ones relevant to problems with imbalanced data.
Several machine learning and deep learning metrics are used for evaluating the performance of classification models.
Let’s look at some of the helpful metrics that can help evaluate the performance of our model on the test set.
Given a model that tries to classify an example as belonging to the positive or negative class, there are four possibilities:
Table 1.1 shows in what ways the model can get “confused” when making predictions, aptly called the confusion matrix. The confusion matrix forms the basis of many common metrics in machine learning:
Predicted Positive |
Predicted Negative |
|
Actually Positive |
True Positive (TP) |
False Negative (FN) |
Actually Negative |
False Positive (FP) |
True Negative (TN) |
Table 1.1 – Confusion matrix
Let’s look at some of the most common metrics in machine learning:
sklearn
library as sklearn.metrics.accuracy_score
.sklearn
library under the name sklearn.metrics.precision_score
. .Recall measures the model’s ability to correctly detect all the positive instances. Recall can be considered to be the accuracy of the positive class in binary classification. You can find this functionality in the sklearn
library under the name sklearn.metrics.recall_score
.
Table 1.2 summarizes the differences between precision and recall:
Precision |
Recall |
|
Definition |
Precision is a measure of trustworthiness |
Recall is a measure of completeness |
Question to ask |
When the model says something is positive, how often is it right? |
Out of all the positive instances, how many did the model correctly identify? |
Example (using an email filter) |
Precision measures how many of the emails the model flags as spam are actually spam, as a percentage of all the flagged emails |
Recall measures how many of the actual spam emails the model catches, as a percentage of all the spam emails in the dataset |
Formula |
Table 1.2 – Precision versus recall
Why can accuracy be a bad metric for imbalanced datasets?
Let’s assume we have an imbalanced dataset with 1,000 examples, with 100 labels belonging to class 1 (the minority class) and 900 belonging to class 0 (the majority class).
Let’s say we have a model that always predicts 0 for all examples. The model’s accuracy for the minority class is
Figure 1.7 – A comic showing accuracy may not always be the right metric
This brings us to the precision-recall trade-off in machine learning. Usually, precision and recall are inversely correlated – that is, when recall increases, precision most often decreases. Why? Note that recall and for recall to increase, FN should decrease. This means the model needs to classify more items as positive. However, if the model classifies more items as positive, some of these will likely be incorrect classifications, leading to an increase in the number of false positives (FPs). As the number of FPs increases, precision, defined as will decrease. With similar logic, you can argue that when recall decreases, precision often increases.
Next, let’s try to understand some of the precision and recall-based metrics that can help measure the performance of models trained on imbalanced data:
sklearn
library as sklearn.metrics.f1_score
.The formula for the F-beta score is as follows:
Here, beta is a positive parameter that determines the weight given to precision in the calculation of the score. When beta is set to 1, the F1 score is obtained, which is the harmonic mean of precision and recall. The F-beta score is a useful metric for imbalanced datasets, where one class may be more important than the other. By adjusting the beta parameter, we can control the relative importance of precision and recall for a particular class. For example, if we want to prioritize precision over recall for the minority class, we can set beta < 1. To see why that’s the case, set in the formula, which implies .
Conversely, if we want to prioritize recall over precision for the minority class, we can set beta > 1 (we can set β = ∞ in the formula to see it reduce to recall).
In practice, the choice of beta parameter depends on the specific problem and the desired trade-off between precision and recall. In general, higher values of beta result in more emphasis on recall, while lower values of beta result in more emphasis on precision. This is available in the sklearn
library as sklearn.metrics.fbeta_score
.
sklearn
library as sklearn.metrics.balanced_accuracy_score
.sklearn.metrics.precision_recall_fscore_support
and imblearn.metrics.classification_report_imbalanced
APIs.imbalanced-learn
, geometric_mean_score()
is defined by the geometric mean of “accuracy on positive class examples” (recall or sensitivity or TPR) and “accuracy on negative class examples” (specificity or TNR). So, even if one class is heavily outnumbered by the other class, the metric will still be representative of the model’s overall performance.imblearn.metrics.classification_report_imbalanced
.Table 1.3 shows the associated metrics and their formulas as an extension of the confusion matrix:
Predicted Positive |
Predicted Negative |
||
Actually Positive |
True positive (TP) |
False negative (FN) |
Recall = Sensitivity = True positive rate (TPR) = |
Actually Negative |
False positive (FP) |
True negative (TN) |
|
Precision = TP/(TP+FP) FPR = FP/(FP+TN) |
Table 1.3 – Confusion matrix with various metrics and their definitions
Receiver Operating Characteristics, commonly known as ROC curves, are plots that display the TPR on the y-axis against the FPR on the x-axis for various threshold values:
Figure 1.8 – The ROC curve as a plot of TPR versus FPR (the dotted line shows a model with no skill)
Some properties of the ROC curve are as follows:
P(score(x+ ) > score(x− ))
Here, 𝑥+ denotes the positive (minority) class, and 𝑥− denotes the negative (majority) class.
Now, let’s look at some of the problems in using ROC for imbalanced datasets:
Similar to ROC curves, Precision-Recall (PR) curves plot a pair of metrics for different threshold values. But unlike ROC curves, which plot TPR and FPR, PR curves plot precision and recall. To demonstrate the difference between the two curves, let’s say we compare the performance of two models – Model 1 and Model 2 – on a particular handcrafted imbalanced dataset:
This discrepancy between the ROC and PR curves also underscores the importance of using multiple metrics for model evaluation, particularly when dealing with imbalanced data:
Figure 1.9 – The PR curve can show obvious differences between models compared to the ROC curve
Let’s try to understand these observations in detail. While the ROC curve shows very little difference between the performance of the two models, the PR curve shows a much bigger gap. The reason for this is that the ROC curve uses FPR, which is FP/(FP+TN). Usually, TN is really high for an imbalanced dataset, and hence even if FP changes by a decent amount, FPR’s overall value is overshadowed by TN. Hence, ROC doesn’t change by a whole lot.
The conclusion of which classifier is superior can change with the distribution of classes in the test set. In the case of skewed datasets, the PR curve can more clearly show that the model did not work well compared to the ROC curve, as shown in the preceding figure.
The average precision is a single number that’s used to summarize a PR curve, and the corresponding API in sklearn
is sklearn.metrics.average_precision_score
.
The primary distinction between the ROC curve and the PR curve lies in the fact that while ROC assesses how well the model can “calculate” both positive and negative classes, PR solely focuses on the positive class. Therefore, when dealing with a balanced dataset scenario and you are concerned with both the positive and negative classes, ROC AUC works exceptionally well. In contrast, when dealing with an imbalanced situation, PR AUC is more suitable. However, it’s important to keep in mind that PR AUC only evaluates the model’s ability to “calculate” the positive class. Because PR curves are more sensitive to the positive (minority) class, we will be using PR curves throughout the first half of this book.
We can reimagine the PR curve with precision on the x-axis and TPR, also known as recall, on the y-axis. The key difference between the two curves is that while the ROC curve uses FPR, the PR curve uses precision.
As discussed earlier, FPR tends to be very low when dealing with imbalanced datasets. This aspect of having low FPR values is crucial in certain applications such as fraud detection, where the capacity for manual investigations is inherently limited. Consequently, this perspective can alter the perceived performance of classifiers. As shown in Figure 1.9, it’s also possible that the performances of the two models seem reversed when compared using average precision (0.69 versus 0.90) instead of AUC-ROC (0.97 and 0.95).
Let’s summarize this:
As TPR equals recall, the two plots only differ in what recall is compared to – either precision or FPR. Additionally, the plots are rotated by 90 degrees relative to each other:
AUC-ROC |
AUC-PR |
|
General formula |
AUC(TPR, FPR) |
AUC(Precision, Recall) |
Expanded formula |
|
|
Equivalence |
AUC(Recall, FPR) |
AUC(Precision, Recall) |
Table 1.4 – Comparing the ROC and PR curves
In the next few sections, we’ll explore the circumstances that lead to imbalances in datasets, the challenges these imbalances can pose, and the situations where data imbalance might not be a concern.
In certain instances, directly using data for machine learning without worrying about data imbalance can yield usable results suitable for a given business scenario. Yet, there are situations where a more dedicated effort is needed to manage the effects of imbalanced data.
Broad statements claiming that you must always or never adjust for imbalanced classes tend to be misleading. The truth is that the need to address class imbalance is contingent on the specific characteristics of the data, the problem at hand, and the definition of an acceptable solution. Therefore, the approach to dealing with class imbalance should be tailored according to these factors.
In this section, we’ll explore various situations and causes leading to an imbalance in datasets, such as rare event occurrences or skewed data collection processes:
Let’s delve into the difficulties posed by imbalanced data on model predictions and their impact on model performance:
Figure 1.10 – Change in decision boundary with a different distribution of minority class examples – the crosses denote the majority class, and the circles denote the minority class
Next, let’s try to see when we shouldn’t do anything about data imbalance.
Class imbalance may not always negatively impact performance, and using imbalance-specific methods can sometimes worsen results [5]. Therefore, it’s crucial to accurately assess whether a task is genuinely affected by class imbalance before applying any specialized techniques. One such strategy can be as simple as setting up a baseline model without worrying about class imbalance and observing the model’s performance on various classes using various performance metrics.
Let’s explore scenarios where data imbalance may not be a concern and no corrective measures may be needed:
In the next section, we will become familiar with a library that can be very useful when dealing with imbalanced data. We will train a model on an imbalanced toy dataset and look at some metrics to evaluate the performance of the trained model.
imbalanced-learn
(imported as imblearn
) is a Python package that offers several techniques to deal with data imbalance. In the first half of this book, we will rely heavily on this library. Let’s install the imbalanced-learn
library:
pip3 install imbalanced-learn==0.11.0
We can use imbalanced-learn
to create a synthetic dataset for our analysis:
from sklearn.datasets import make_classification import pandas as pd import matplotlib.pyplot as plt import seaborn as sns def make_data(sep): X, y = make_classification(n_samples=50000, n_features=2, n_redundant=0, n_clusters_per_class=1, weights=[0.995], class_sep=sep, random_state=1) X = pd.DataFrame(X, columns=['feature_1', 'feature_2']) y = pd.Series(y) return X, y
Let’s analyze the generated dataset:
from collections import Counter X, y = make_data(sep=2) print(y.value_counts()) sns.scatterplot(data=X, x="feature_1", y="feature_2", hue=y) plt.title('Separation: {}'.format(separation)) plt.show()
Here’s the output:
0 49498 1 502
Figure 1.11 – 2 class dataset with two features
Let’s split this dataset into training and test sets:
From sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, stratify = \ y, test_size=0.2, random_state=42) print('train data: ', Counter(y_train)) print('test data: ', Counter(y_test))
Here’s the output:
train data: Counter({0: 39598, 1: 402}) test data: Counter({0: 9900, 1: 100})
Note the usage of stratify
in the train_test_split
API of sklearn
. Specifying stratify=y
ensures we maintain the same ratio of majority and minority classes in both the training set and the test set. Let’s understand stratification in more detail.
Stratified sampling is a way to split the dataset into various subgroups (called “strata”) based on certain characteristics they share. It can be highly valuable when dealing with imbalanced datasets because it ensures that the train and test datasets have the same proportions of class labels as the original dataset.
In an imbalanced dataset, the minority class constitutes a small fraction of the total data. If we perform a simple random split without any stratification, there’s a risk that the minority class may not be adequately represented in the training set or could be entirely left out from the test set, which may lead to poor performance and unreliable evaluation metrics.
With stratified sampling, the proportion of each class in the overall dataset is preserved in both training and test sets, ensuring representative sampling and a better chance for the model to learn from the minority class. This leads to a more robust model and a more reliable evaluation of the model’s performance.
The scikit-learn APIs for stratification
The scikit-learn
APIs, such as RepeatedStratifiedKFold
and StratifiedKFold
, employ the concept of stratification to evaluate model performance through cross-validation, especially when working with imbalanced datasets.
Now, let’s train a logistic regression model on training data:
from sklearn.linear_model import LogisticRegression lr = LogisticRegression(random_state=0, max_iter=2000) lr.fit(X_train, y_train) y_pred = lr.predict(X_test)
Let’s get the report metrics from the sklearn
library:
from sklearn.metrics import classification_report print(classification_report(y_test, y_pred))
precision recall f1-score support 0 0.99 1.00 1.00 9900 1 0.94 0.17 0.29 100 accuracy 0.99 10000 macro avg 0.97 0.58 0.64 10000 weighted avg 0.99 0.99 0.99 10000
Let’s get the report metrics from imblearn
:
from imblearn.metrics import classification_report_imbalanced print(classification_report_imbalanced(y_test, y_pred))
This outputs a lot more columns:
Figure 1.12 – Output of the classification report from imbalanced-learn
Do you notice the extra metrics here compared to the API of sklearn
? We got three additional metrics: spe
for specificity, geo
for geometric mean, and iba
for index balanced accuracy.
The imblearn.metrics
module has several such functions that can be helpful for imbalanced datasets. Apart from classification_report_imbalanced()
, it offers APIs such as sensitivity_specificity_support()
, geometric_mean_score()
, sensitivity_score()
, and specificity_score()
.
Usually, the first step in any machine learning pipeline should be to split the data into train/test/validation sets. We should avoid applying any techniques to handle the imbalance until after the data has been split. We should begin by splitting the data into training, testing, and validation sets and then proceed with any necessary adjustments to the training data. Applying techniques such as oversampling (see Chapter 2, Oversampling Methods) before splitting the data can result in data leakage, overfitting, and over-optimism [6].
We should ensure that the validation data closely resembles the test data. Both validation data and test data should represent real-world scenarios on which the model will be used for prediction. Avoid applying any sampling techniques or modifications to the validation set. The only requirement is to include a sufficient number of samples from all classes.
Let’s switch to discussing a bit about using unsupervised learning algorithms. Anomaly detection or outlier detection is a class of problems that can be used for dealing with imbalanced data problems. Anomalies or outliers are data points that deviate significantly from the rest of the data. These anomalies often correspond to the minority class in an imbalanced dataset, making unsupervised methods potentially useful.
The term that’s often used for these kinds of problems is one-class classification. This technique is particularly beneficial when the positive (minority) cases are sparse or when gathering them before the training is not feasible. The model is trained exclusively on what is considered the “normal” or majority class. It then classifies new instances as “normal” or “anomalous,” effectively identifying what could be the minority class. This can be especially useful for binary imbalanced classification problems, where the majority class is deemed “normal,” and the minority class is considered an anomaly.
However, it does have a drawback: outliers or positive cases during training are discarded [7], which could lead to the potential loss of valuable information.
In summary, while unsupervised methods such as one-class classification offer an alternative for managing class imbalance, our discussion in this book will remain centered on supervised learning algorithms. Nevertheless, we recommend that you explore and experiment with such solutions when you find them appropriate.
Let’s summarize what we’ve learned so far. Imbalanced data is a common problem in machine learning, where there are significantly more instances of one class than another. Imbalanced datasets can arise from various situations, including rare event occurrences, high data collection costs, noisy labels, labeling errors, sampling bias, and data cleaning. This can be a challenge for machine learning models as they may be biased toward the majority class.
Several techniques can be used to deal with imbalanced data, such as oversampling, undersampling, and cost-sensitive learning. The best technique to use depends on the specific problem and the data.
In some cases, data imbalance may not be a concern. When the dataset is sufficiently large, the impact of data imbalance on the model’s performance may be reduced. However, it is still advisable to compare the baseline model’s performance with the performance of models that have been built using techniques that address data imbalance, such as threshold adjustment, data-based techniques (oversampling and undersampling), and algorithm-based techniques.
Traditional performance metrics such as accuracy can fail in imbalanced datasets. Some more useful metrics for imbalanced datasets are the ROC curve, the PR curve, precision, recall, and F1 score. While ROC curves are suitable for balanced datasets, PR curves are more suitable for imbalanced datasets when one class is more important than the other.
The imbalanced-learn
library is a Python package that offers several techniques to deal with data imbalance.
There are some general rules to follow, such as splitting the data into train/test/validation sets before applying any techniques to handle the imbalance in the data, ensuring that the validation data closely resembles the test data and that test data represents the data on which the model will make final predictions, and avoiding applying any sampling techniques or modifications to the validation set and test set.
One-class classification or anomaly detection is another technique that can be used for dealing with unsupervised imbalanced data problems. In this book, we will focus our discussion on supervised learning algorithms only.
In the next chapter, we will look at one of the common ways to handle the data imbalance problem in datasets by applying oversampling techniques.
fetch_dataset
API and then compute the values of MCC, accuracy, precision, recall, and F1 score. See if the MCC value can be a useful metric for this dataset.Where there is an eBook version of a title available, you can buy it from the book details for that title. Add either the standalone eBook or the eBook and print book bundle to your shopping cart. Your eBook will show in your cart as a product on its own. After completing checkout and payment in the normal way, you will receive your receipt on the screen containing a link to a personalised PDF download file. This link will remain active for 30 days. You can download backup copies of the file by logging in to your account at any time.
If you already have Adobe reader installed, then clicking on the link will download and open the PDF file directly. If you don't, then save the PDF file on your machine and download the Reader to view it.
Please Note: Packt eBooks are non-returnable and non-refundable.
Packt eBook and Licensing When you buy an eBook from Packt Publishing, completing your purchase means you accept the terms of our licence agreement. Please read the full text of the agreement. In it we have tried to balance the need for the ebook to be usable for you the reader with our needs to protect the rights of us as Publishers and of our authors. In summary, the agreement says:
If you want to purchase a video course, eBook or Bundle (Print+eBook) please follow below steps:
Our eBooks are currently available in a variety of formats such as PDF and ePubs. In the future, this may well change with trends and development in technology, but please note that our PDFs are not Adobe eBook Reader format, which has greater restrictions on security.
You will need to use Adobe Reader v9 or later in order to read Packt's PDF eBooks.
Packt eBooks are a complete electronic version of the print edition, available in PDF and ePub formats. Every piece of content down to the page numbering is the same. Because we save the costs of printing and shipping the book to you, we are able to offer eBooks at a lower cost than print editions.
When you have purchased an eBook, simply login to your account and click on the link in Your Download Area. We recommend you saving the file to your hard drive before opening it.
For optimal viewing of our eBooks, we recommend you download and install the free Adobe Reader version 9.