You're reading from Machine Learning for Imbalanced Data

Product type Book

Published in Nov 2023

Publisher Packt

ISBN-13 9781801070836

Pages 344 pages

Edition 1st Edition

Languages

Rust

Concepts

Data Science

Authors (2):

Kumar Abhishek

Dr. Mounir Abdelaziz

View More author details

Table of Contents (15) Chapters

Preface

Chapter 1: Introduction to Data Imbalance in Machine Learning

Chapter 2: Oversampling Methods

Chapter 3: Undersampling Methods

Chapter 4: Ensemble Methods

Chapter 5: Cost-Sensitive Learning

Chapter 6: Data Imbalance in Deep Learning

Chapter 7: Data-Level Deep Learning Methods

Chapter 8: Algorithm-Level Deep Learning Techniques

Chapter 9: Hybrid Deep Learning Methods

Chapter 10: Model Calibration

Assessments

Index

Why subscribe?

Other Books You May Enjoy

Appendix: Machine Learning Pipeline in Production

Undersampling Methods

Sometimes, you have so much data that adding more data by oversampling only makes things worse. Don’t worry, as we have a strategy for those situations as well. It’s called undersampling, or downsampling. In this chapter, you will learn about the concept of undersampling, including when to use it and the various techniques to perform it. You will also see how to use these techniques via the imbalanced-learn library APIs and compare their performance with some classical machine learning models.

In this chapter, we will cover the following topics:

Introducing undersampling
When to avoid undersampling in the majority class
Removing examples uniformly
Strategies for removing noisy observations
Strategies for removing easy observations

By the end of this chapter, you’ll have mastered various undersampling techniques for imbalanced datasets and will be able to confidently apply them with the imbalanced-learn library...

Technical requirements

This chapter will make use of common libraries such as matplotlib, seaborn, pandas, numpy, scikit-learn, and imbalanced-learn. The code and notebooks for this chapter can be found on GitHub at https://github.com/PacktPublishing/Machine-Learning-for-Imbalanced-Data/tree/master/chapter03. To run the notebook, there are two options: you can click the Open in Colab icon at the top of the chapter’s notebook, or you can launch it directly from https://colab.research.google.com using the GitHub URL of the notebook.

Introducing undersampling

Two households, both alike in dignity,

In fair Verona, where we lay our scene,

From ancient grudge break to new mutiny,

Where civil blood makes civil hands unclean.

– Opening lines of Romeo and Juliet, by Shakespeare

Let’s look at a scenario inspired by Shakespeare’s play Romeo and Juliet. Imagine a town with two warring communities (viz., the Montagues and Capulets). They have been enemies for generations. The Montagues are in the minority and the Capulets are in the majority in the town. The Montagues are super rich and powerful. The Capulets are not that well off. This creates a complex situation in the town. There are regular riots in the town because of this rivalry. One day, the Montagues win the king’s favor and conspire to eliminate some Capulets to bring their numbers down. The idea is that if fewer Capulets are in the town, the Montagues will no longer be in the minority. The king agrees to the plan as he...

When to avoid undersampling the majority class

Undersampling is not a panacea and may not always work. It depends on the dataset and model under consideration:

Too little training data for all the classes: If the dataset is already small, undersampling the majority class can lead to a significant loss of information. In such cases, it is advisable to try gathering more data or exploring other techniques, such as oversampling the minority class to balance the class distribution.
Majority class equally important or more important than minority class: In specific scenarios, such as the spam filtering example mentioned in Chapter 1, Introduction to Data Imbalance in Machine Learning, it is crucial to maintain high accuracy in identifying the majority class instances. In such situations, undersampling the majority class might reduce the model’s ability to accurately classify majority class instances, leading to a higher false positive rate. Instead, alternative methods...

Removing examples uniformly

There are two major ways of removing the majority class examples uniformly from the data. The first way is to remove the examples randomly, and the other way involves using clustering techniques. Let’s discuss both of these methods in detail.

Random UnderSampling

The first technique the king might think of is to pick Capulets randomly and remove them from the town. This is a naïve approach. It might work, and the king might be able to bring peace to the town. But the king might cause unforeseen damage by picking up some influential Capulets. However, it is an excellent place to start our discussion. This technique can be considered a close cousin of random oversampling. In Random UnderSampling (RUS), as the name suggests, we randomly extract observations from the majority class until the classes are balanced. This technique inevitably leads to data loss, might harm the underlying structure of the data, and thus performs poorly sometimes...

Strategies for removing noisy observations

The king might decide to look at the friendships and locations of the citizens before removing anyone. The king might decide to remove the Capulets who are rich and live near the Montagues. This could bring peace to the city by separating the feuding clans. Let’s look at some strategies to do that with our data.

ENN, RENN, and AllKNN

The king can remove the Capulets based on their neighbors. For example, if one or more of the three closest neighbors of a Capulet is a Montague, the king can remove the Capulet. This technique is called Edited Nearest Neighbors (ENN) [5]. ENN removes the examples near the decision boundary to increase the separation between classes. We fit a KNN to the whole dataset and remove the examples whose neighbors don’t belong to the same class. The imbalanced-learn library gives us options to decide which classes we would like to resample and what kind of class arrangement the neighbors of the sample...

Strategies for removing easy observations

The reverse of the strategy to remove the rich and famous Capulets is to remove the poor and weak Capulets. This section will discuss the techniques for removing the majority samples far away from the minority samples. Instead of removing the samples from the boundary between the two classes, we use them for training a model. This way, we can train a model to better discriminate between the classes. However, one downside is that these algorithms risk retaining noisy data points, which could then be used to train the model, potentially introducing noise into the predictive system.

Condensed Nearest Neighbors

Condensed Nearest Neighbors (CNNeighbors) [11] is an algorithm that works as follows:

We add all minority samples to a set and one randomly selected majority sample. Let’s call this set C.
We train a KNN model with k = 1 on set C.
Now, we repeat the following four steps for each of the remaining majority samples...

Summary

In this chapter, we discussed undersampling, an approach to address the class imbalance in datasets by reducing the number of samples in the majority class. We reviewed the advantages of undersampling, such as keeping the data size in check and reducing the chances of overfitting. Undersampling methods can be categorized into fixed methods, which reduce the number of majority class samples to a fixed size, and cleaning methods, which reduce majority class samples based on predetermined criteria.

We went over various undersampling techniques, including random undersampling, instance hardness-based undersampling, ClusterCentroids, ENN, Tomek links, NCR, instance hardness, CNNeighbors, one-sided selection, and combinations of undersampling and oversampling techniques, such as SMOTEENN and SMOTETomek.

We concluded with a performance comparison of various undersampling techniques from the imbalanced-learn library on logistic regression and random forest models, using a few...

Exercises

Explore the various undersampling APIs available from the imbalanced-learn library at https://imbalanced-learn.org/stable/references/under_sampling.html.
Explore the NearMiss undersampling technique, available through the imblearn.under_sampling.NearMiss API. Which class of methods does it belong to? Apply the NearMiss method to the dataset that we used in the chapter.
Try all the undersampling methods discussed in this chapter on the us_crime dataset from UCI. You can find this dataset in the fetch_datasets API of the imbalanced-learn library. Find the undersampling method with the highest f1-score metric for LogisticRegression and XGBoost models.
Can you identify an undersampling method of your own? (Hint: think about combining the various approaches to undersampling in new ways.)

References

X. He et al., “Practical Lessons from Predicting Clicks on Ads at Facebook,” in Proceedings of the Eighth International Workshop on Data Mining for Online Advertising, New York NY USA: ACM, Aug. 2014, pp. 1–9. doi: 10.1145/2648584.2648589.
X. Ling, W. Deng, C. Gu, H. Zhou, C. Li, and F. Sun, “Model Ensemble for Click Prediction in Bing Search Ads,” in Proceedings of the 26th International Conference on World Wide Web Companion - WWW ’17 Companion, Perth, Australia: ACM Press, 2017, pp. 689–698. doi: 10.1145/3041021.3054192.
How Uber Optimizes the Timing of Push Notifications using ML and Linear Programming: https://www.uber.com/blog/how-uber-optimizes-push-notifications-using-ml/.
A. D. Pozzolo, O. Caelen, R. A. Johnson, and G. Bontempi, “Calibrating Probability with Undersampling for Unbalanced Classification,” in 2015 IEEE Symposium Series on Computational Intelligence, Cape Town, South Africa...