Search icon
Arrow left icon
All Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletters
Free Learning
Arrow right icon
Machine Learning for Imbalanced Data

You're reading from  Machine Learning for Imbalanced Data

Product type Book
Published in Nov 2023
Publisher Packt
ISBN-13 9781801070836
Pages 344 pages
Edition 1st Edition
Languages
Authors (2):
Kumar Abhishek Kumar Abhishek
Profile icon Kumar Abhishek
Dr. Mounir Abdelaziz Dr. Mounir Abdelaziz
Profile icon Dr. Mounir Abdelaziz
View More author details

Table of Contents (15) Chapters

Preface Chapter 1: Introduction to Data Imbalance in Machine Learning Chapter 2: Oversampling Methods Chapter 3: Undersampling Methods Chapter 4: Ensemble Methods Chapter 5: Cost-Sensitive Learning Chapter 6: Data Imbalance in Deep Learning Chapter 7: Data-Level Deep Learning Methods Chapter 8: Algorithm-Level Deep Learning Techniques Chapter 9: Hybrid Deep Learning Methods Chapter 10: Model Calibration Assessments Index Other Books You May Enjoy Appendix: Machine Learning Pipeline in Production

Introduction to imbalanced datasets

Machine learning algorithms learn from collections of examples that we call datasets. These datasets contain multiple data samples or points, which we may refer to as examples, samples, or instances interchangeably throughout this book.

A dataset can be said to have a balanced distribution when all the target classes have a similar number of examples, as shown in Figure 1.1:

Figure 1.1 – Balanced distribution with an almost equal number of examples for each class

Imbalanced datasets or skewed datasets are those that have some target classes (also called labels) that outnumber the rest of the classes (Figure 1.2). Though this generally applies to classification problems (for example, fraud detection) in machine learning, they inevitably occur in regression problems (for example, house price prediction) too:

Figure 1.2 – An imbalanced dataset with five classes and a varying number of samples

We label the class with more instances as the “majority” or “negative” class and the one with fewer instances as the “minority” or “positive” class. Most of the time, our main interest lies in the minority class, which is why we often refer to the minority class as the “positive” class and to the majority class as the “negative” class:

Figure 1.3 – A visual guide to common terminology used in imbalanced classification

This can be scaled to more than two classes, and such classification problems are called multi-class classification. In the first half of this book, we will focus our attention only on binary class classification to keep the material easier to grasp. It’s relatively easy to extend the concepts to multi-class classification.

Let’s look at a few examples of imbalanced datasets:

  • Fraud detection is where fraudulent transactions need to be detected out of several transactions. This problem is often encountered and widely used in finance, healthcare, and e-commerce industries.
  • Network intrusion detection using machine learning involves analyzing large volumes of network traffic data to detect and prevent instances of unauthorized access and misuse of computer systems.
  • Cancer detection. Cancer is not rare, but we still may want to use machine learning to analyze medical data to identify potential cases of cancer earlier and improve treatment outcomes.

In this book, we would like to focus on the class imbalance problem in general and look at various solutions where we see that class imbalance is affecting the performance of our model. A typical problem is that models perform quite poorly on the minority classes for which the model has seen a very low number of examples during model training.

You have been reading a chapter from
Machine Learning for Imbalanced Data
Published in: Nov 2023 Publisher: Packt ISBN-13: 9781801070836
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime}