Reader small image

You're reading from  Machine Learning for Imbalanced Data

Product typeBook
Published inNov 2023
Reading LevelBeginner
PublisherPackt
ISBN-139781801070836
Edition1st Edition
Languages
Concepts
Right arrow
Authors (2):
Kumar Abhishek
Kumar Abhishek
author image
Kumar Abhishek

Kumar Abhishek is a seasoned Senior Machine Learning Engineer at Expedia Group, US, specializing in risk analysis and fraud detection for Expedia brands. With over a decade of experience at companies such as Microsoft, Amazon, and a Bay Area startup, Kumar holds an MS in Computer Science from the University of Florida.
Read more about Kumar Abhishek

Dr. Mounir Abdelaziz
Dr. Mounir Abdelaziz
author image
Dr. Mounir Abdelaziz

Dr. Mounir Abdelaziz is a deep learning researcher specializing in computer vision applications. He holds a Ph.D. in computer science and technology from Central South University, China. During his Ph.D. journey, he developed innovative algorithms to address practical computer vision challenges. He has also authored numerous research articles in the field of few-shot learning for image classification.
Read more about Dr. Mounir Abdelaziz

View More author details
Right arrow

Why can imbalanced data be a challenge?

Let’s delve into the difficulties posed by imbalanced data on model predictions and their impact on model performance:

  • Failure of metrics such as accuracy: As we discussed previously, conventional metrics such as accuracy can be misleading in the context of imbalanced data (a 99% imbalanced dataset would still achieve 99% accuracy). Threshold-invariant metrics such as the PR curve or ROC curve attempt to expose the performance of the model over a wide range of thresholds. The real challenge lies in the disproportionate influence of the “true negative” cell in the confusion matrix. Metrics that focus less on “true negatives,” such as precision, recall, or F1 score, are more appropriate for evaluating model performance. It’s important to note that these metrics have a hidden hyperparameter – the classification threshold – that should not be ignored but optimized for real-world applications (refer to Chapter 5, Cost-Sensitive Learning, to learn more about threshold tuning).
  • Imbalanced data can be a challenge for a model’s loss function: This may happen because the loss function is typically designed to minimize the errors between the predicted outputs and the true labels of the training data. When the data is imbalanced, there are more instances of one class than another, and the model may become biased toward the majority class. We will discuss solutions to this issue in more detail in Chapter 5, Cost-Sensitive Learning, and Chapter 8, Algorithm-Level Deep Learning Techniques.
  • Different misclassification costs for different classes: Often, it may be more expensive to misclassify positive examples than to misclassify negative examples. We may have false positives that are more expensive than false negatives. For example, usually, the cost of misclassifying a patient with cancer as healthy (false negative) will be much higher than misclassifying a healthy patient as having cancer (false positive). Why? Because it’s much cheaper to go through some extra tests to revalidate the test results in the second case instead of detecting it much later in the first case. This is called the cost of misclassification, which could be different for the majority and minority classes, making things complicated for imbalanced datasets. We will discuss more about this in Chapter 5, Cost-Sensitive Learning.
  • Constraints on computational resources: In sectors such as finance, healthcare, and retail, handling big data is a common challenge. Training on these large datasets is not only time-consuming but also costly due to the computational power needed. In such scenarios, downsampling or undersampling the majority class becomes essential, as will be discussed in Chapter 3, Undersampling Methods. Additionally, acquiring more samples for the minority class can further increase dataset size and computational costs. Memory limitations may also restrict the amount of data that can be processed.
  • Not enough variation in the minority class examples to sufficiently represent its distribution: Often, an absolute number of samples of the minority class is not as big of a problem as the variation in the samples of the minority class. The dataset might look large, but there might not be many variations or varieties in the samples that adequately represent the distribution of minority classes. This can lead to the model not being able to learn the classification boundary properly, which would lead to poor performance of the model (Figure 1.10). This can often happen in computer vision problems, such as object detection, where we may have very few samples of certain classes. In such cases, data augmentation techniques (discussed in Chapter 7, Data-Level Deep Learning Methods) can help significantly:

Figure 1.10 – Change in decision boundary with a different distribution of minority class examples – the crosses denote the majority class, and the circles denote the minority class

  • Poor performance of uncalibrated models: Imbalanced data can be a challenge for uncalibrated models. Uncalibrated models are models that do not output well-calibrated probabilities, which means that the predicted probabilities may not reflect the true likelihood of the predicted classes:
    • When dealing with imbalanced data, uncalibrated models can be particularly susceptible to producing biased predictions toward the majority class as they may not be able to effectively differentiate between the minority and majority classes. This can lead to poor performance in the minority class, where the model may produce overly confident predictions or predictions that are too conservative.
    • For example, an uncalibrated model that is trained on imbalanced data may incorrectly classify instances that belong to the minority class as majority class examples, often with high confidence. This is because the model may not have learned to adjust its predictions based on the imbalance in the data and may not have a good understanding of the minority class examples.
    • To address this challenge, it is important to use well-calibrated models [4] that can output probabilities that reflect the true likelihood of the predicted classes. This can be achieved through techniques such as Platt scaling or isotonic regression, which can calibrate the predicted probabilities of an uncalibrated model to produce more accurate and reliable probabilities. Model calibration will be discussed in detail in Chapter 10, Model Calibration.
  • Poor performance of models because of non-adjusted thresholds: It’s important to use intelligent thresholding when making predictions using models trained on imbalanced datasets. Simply predicting 1 when the model probability is over 0.5 may not always be the best approach. Instead, we should consider other thresholds that may be more effective. This can be achieved by examining the PR curve of the model rather than relying solely on its success rate with a default probability threshold of 0.5. Threshold adjustment can be quite important, even for models trained on naturally or artificially balanced datasets. We will discuss threshold adjustment in detail in Chapter 5, Cost-Sensitive Learning.

Next, let’s try to see when we shouldn’t do anything about data imbalance.

Previous PageNext Page
You have been reading a chapter from
Machine Learning for Imbalanced Data
Published in: Nov 2023Publisher: PacktISBN-13: 9781801070836
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Authors (2)

author image
Kumar Abhishek

Kumar Abhishek is a seasoned Senior Machine Learning Engineer at Expedia Group, US, specializing in risk analysis and fraud detection for Expedia brands. With over a decade of experience at companies such as Microsoft, Amazon, and a Bay Area startup, Kumar holds an MS in Computer Science from the University of Florida.
Read more about Kumar Abhishek

author image
Dr. Mounir Abdelaziz

Dr. Mounir Abdelaziz is a deep learning researcher specializing in computer vision applications. He holds a Ph.D. in computer science and technology from Central South University, China. During his Ph.D. journey, he developed innovative algorithms to address practical computer vision challenges. He has also authored numerous research articles in the field of few-shot learning for image classification.
Read more about Dr. Mounir Abdelaziz