Search icon
Arrow left icon
All Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletters
Free Learning
Arrow right icon
Machine Learning for Imbalanced Data

You're reading from  Machine Learning for Imbalanced Data

Product type Book
Published in Nov 2023
Publisher Packt
ISBN-13 9781801070836
Pages 344 pages
Edition 1st Edition
Languages
Authors (2):
Kumar Abhishek Kumar Abhishek
Profile icon Kumar Abhishek
Dr. Mounir Abdelaziz Dr. Mounir Abdelaziz
Profile icon Dr. Mounir Abdelaziz
View More author details

Table of Contents (15) Chapters

Preface Chapter 1: Introduction to Data Imbalance in Machine Learning Chapter 2: Oversampling Methods Chapter 3: Undersampling Methods Chapter 4: Ensemble Methods Chapter 5: Cost-Sensitive Learning Chapter 6: Data Imbalance in Deep Learning Chapter 7: Data-Level Deep Learning Methods Chapter 8: Algorithm-Level Deep Learning Techniques Chapter 9: Hybrid Deep Learning Methods Chapter 10: Model Calibration Assessments Index Other Books You May Enjoy Appendix: Machine Learning Pipeline in Production

Introduction to the imbalanced-learn library

imbalanced-learn (imported as imblearn) is a Python package that offers several techniques to deal with data imbalance. In the first half of this book, we will rely heavily on this library. Let’s install the imbalanced-learn library:

pip3 install imbalanced-learn==0.11.0

We can use imbalanced-learn to create a synthetic dataset for our analysis:

from sklearn.datasets import make_classification
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
def make_data(sep):
    X, y = make_classification(n_samples=50000,
        n_features=2, n_redundant=0,
        n_clusters_per_class=1, weights=[0.995],
        class_sep=sep, random_state=1)
    X = pd.DataFrame(X, columns=['feature_1', 'feature_2'])
    y = pd.Series(y)
    return X, y

Let’s analyze the generated dataset:

from collections import Counter
X, y = make_data(sep=2)
print(y.value_counts())
sns.scatterplot(data=X, x="feature_1", y="feature_2", hue=y)
plt.title('Separation: {}'.format(separation))
plt.show()

Here’s the output:

0     49498
1       502

Figure 1.11 – 2 class dataset with two features

Let’s split this dataset into training and test sets:

From sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify = \
    y, test_size=0.2, random_state=42)
print('train data: ', Counter(y_train))
print('test data: ', Counter(y_test))

Here’s the output:

train data:  Counter({0: 39598, 1: 402})
test data:  Counter({0: 9900, 1: 100})

Note the usage of stratify in the train_test_split API of sklearn. Specifying stratify=y ensures we maintain the same ratio of majority and minority classes in both the training set and the test set. Let’s understand stratification in more detail.

Stratified sampling is a way to split the dataset into various subgroups (called “strata”) based on certain characteristics they share. It can be highly valuable when dealing with imbalanced datasets because it ensures that the train and test datasets have the same proportions of class labels as the original dataset.

In an imbalanced dataset, the minority class constitutes a small fraction of the total data. If we perform a simple random split without any stratification, there’s a risk that the minority class may not be adequately represented in the training set or could be entirely left out from the test set, which may lead to poor performance and unreliable evaluation metrics.

With stratified sampling, the proportion of each class in the overall dataset is preserved in both training and test sets, ensuring representative sampling and a better chance for the model to learn from the minority class. This leads to a more robust model and a more reliable evaluation of the model’s performance.

The scikit-learn APIs for stratification

The scikit-learn APIs, such as RepeatedStratifiedKFold and StratifiedKFold, employ the concept of stratification to evaluate model performance through cross-validation, especially when working with imbalanced datasets.

Now, let’s train a logistic regression model on training data:

from sklearn.linear_model import LogisticRegression
lr = LogisticRegression(random_state=0, max_iter=2000)
lr.fit(X_train, y_train)
y_pred = lr.predict(X_test)

Let’s get the report metrics from the sklearn library:

from sklearn.metrics import classification_report
print(classification_report(y_test, y_pred))

This outputs the following:

          precision     recall      f1-score    support
0         0.99          1.00        1.00        9900
1         0.94          0.17        0.29        100
accuracy                                0.99      10000
macro avg       0.97        0.58        0.64      10000
weighted avg    0.99        0.99        0.99      10000

Let’s get the report metrics from imblearn:

from imblearn.metrics import classification_report_imbalanced
print(classification_report_imbalanced(y_test, y_pred))

This outputs a lot more columns:

Figure 1.12 – Output of the classification report from imbalanced-learn

Do you notice the extra metrics here compared to the API of sklearn? We got three additional metrics: spe for specificity, geo for geometric mean, and iba for index balanced accuracy.

The imblearn.metrics module has several such functions that can be helpful for imbalanced datasets. Apart from classification_report_imbalanced(), it offers APIs such as sensitivity_specificity_support(), geometric_mean_score(), sensitivity_score(), and specificity_score().

You have been reading a chapter from
Machine Learning for Imbalanced Data
Published in: Nov 2023 Publisher: Packt ISBN-13: 9781801070836
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime}