You're reading from Machine Learning for Imbalanced Data

Product typeBook

Published inNov 2023

Reading LevelBeginner

PublisherPackt

ISBN-139781801070836

Edition1st Edition

Languages

Rust

Tools

TensorFlow Lite

Concepts

Data Science

Authors (2):

Kumar Abhishek

Dr. Mounir Abdelaziz

View More author details

Why can imbalanced data be a challenge?

Let’s delve into the difficulties posed by imbalanced data on model predictions and their impact on model performance:

Failure of metrics such as accuracy: As we discussed previously, conventional metrics such as accuracy can be misleading in the context of imbalanced data (a 99% imbalanced dataset would still achieve 99% accuracy). Threshold-invariant metrics such as the PR curve or ROC curve attempt to expose the performance of the model over a wide range of thresholds. The real challenge lies in the disproportionate influence of the “true negative” cell in the confusion matrix. Metrics that focus less on “true negatives,” such as precision, recall, or F1 score, are more appropriate for evaluating model performance. It’s important to note that these metrics have a hidden hyperparameter – the classification threshold – that should not be ignored but optimized for real-world applications (refer to Chapter 5, Cost-Sensitive Learning, to learn more about threshold tuning).
Imbalanced data can be a challenge for a model’s loss function: This may happen because the loss function is typically designed to minimize the errors between the predicted outputs and the true labels of the training data. When the data is imbalanced, there are more instances of one class than another, and the model may become biased toward the majority class. We will discuss solutions to this issue in more detail in Chapter 5, Cost-Sensitive Learning, and Chapter 8, Algorithm-Level Deep Learning Techniques.
Different misclassification costs for different classes: Often, it may be more expensive to misclassify positive examples than to misclassify negative examples. We may have false positives that are more expensive than false negatives. For example, usually, the cost of misclassifying a patient with cancer as healthy (false negative) will be much higher than misclassifying a healthy patient as having cancer (false positive). Why? Because it’s much cheaper to go through some extra tests to revalidate the test results in the second case instead of detecting it much later in the first case. This is called the cost of misclassification, which could be different for the majority and minority classes, making things complicated for imbalanced datasets. We will discuss more about this in Chapter 5, Cost-Sensitive Learning.
Constraints on computational resources: In sectors such as finance, healthcare, and retail, handling big data is a common challenge. Training on these large datasets is not only time-consuming but also costly due to the computational power needed. In such scenarios, downsampling or undersampling the majority class becomes essential, as will be discussed in Chapter 3, Undersampling Methods. Additionally, acquiring more samples for the minority class can further increase dataset size and computational costs. Memory limitations may also restrict the amount of data that can be processed.
Not enough variation in the minority class examples to sufficiently represent its distribution: Often, an absolute number of samples of the minority class is not as big of a problem as the variation in the samples of the minority class. The dataset might look large, but there might not be many variations or varieties in the samples that adequately represent the distribution of minority classes. This can lead to the model not being able to learn the classification boundary properly, which would lead to poor performance of the model (Figure 1.10). This can often happen in computer vision problems, such as object detection, where we may have very few samples of certain classes. In such cases, data augmentation techniques (discussed in Chapter 7, Data-Level Deep Learning Methods) can help significantly:

Figure 1.10 – Change in decision boundary with a different distribution of minority class examples – the crosses denote the majority class, and the circles denote the minority class

Poor performance of uncalibrated models: Imbalanced data can be a challenge for uncalibrated models. Uncalibrated models are models that do not output well-calibrated probabilities, which means that the predicted probabilities may not reflect the true likelihood of the predicted classes:
- When dealing with imbalanced data, uncalibrated models can be particularly susceptible to producing biased predictions toward the majority class as they may not be able to effectively differentiate between the minority and majority classes. This can lead to poor performance in the minority class, where the model may produce overly confident predictions or predictions that are too conservative.
- For example, an uncalibrated model that is trained on imbalanced data may incorrectly classify instances that belong to the minority class as majority class examples, often with high confidence. This is because the model may not have learned to adjust its predictions based on the imbalance in the data and may not have a good understanding of the minority class examples.
- To address this challenge, it is important to use well-calibrated models [4] that can output probabilities that reflect the true likelihood of the predicted classes. This can be achieved through techniques such as Platt scaling or isotonic regression, which can calibrate the predicted probabilities of an uncalibrated model to produce more accurate and reliable probabilities. Model calibration will be discussed in detail in Chapter 10, Model Calibration.
Poor performance of models because of non-adjusted thresholds: It’s important to use intelligent thresholding when making predictions using models trained on imbalanced datasets. Simply predicting 1 when the model probability is over 0.5 may not always be the best approach. Instead, we should consider other thresholds that may be more effective. This can be achieved by examining the PR curve of the model rather than relying solely on its success rate with a default probability threshold of 0.5. Threshold adjustment can be quite important, even for models trained on naturally or artificially balanced datasets. We will discuss threshold adjustment in detail in Chapter 5, Cost-Sensitive Learning.

Next, let’s try to see when we shouldn’t do anything about data imbalance.

You have been reading a chapter from

Machine Learning for Imbalanced Data

Published in: Nov 2023Publisher: PacktISBN-13: 9781801070836

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at $15.99/month. Cancel anytime

Authors (2)

Kumar Abhishek

Kumar Abhishek is a seasoned Senior Machine Learning Engineer at Expedia Group, US, specializing in risk analysis and fraud detection for Expedia brands. With over a decade of experience at companies such as Microsoft, Amazon, and a Bay Area startup, Kumar holds an MS in Computer Science from the University of Florida.
Read more about Kumar Abhishek

Dr. Mounir Abdelaziz

Dr. Mounir Abdelaziz is a deep learning researcher specializing in computer vision applications. He holds a Ph.D. in computer science and technology from Central South University, China. During his Ph.D. journey, he developed innovative algorithms to address practical computer vision challenges. He has also authored numerous research articles in the field of few-shot learning for image classification.
Read more about Dr. Mounir Abdelaziz

Personalised recommendations for you

Based on your interests and search pattern

Et al.

Ever wonder why speech recognition systems don't understand the Scottish accent, or what would happen if an astronaut only ate mac 'n' cheese, or other spurious reflections you'd have at a bar? We did, then collated those deliberations into absurd research articles with fake figures and methodologies inspired by even more fictionally absurd studies.

BookAug 2023230 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages4

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages1

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Mastering Tableau 2023

This book is a comprehensive resource to mastering your Tableau skills and becoming a BI expert. As you progress, you will learn how to build advanced dashboards and improve your storytelling to derive key business insight, as well as make you well-versed with advanced functionalities of Tableau in the business intelligence domain.

BookAug 2023684 pages

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages5

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages2

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

BookOct 2023636 pages5

Modern Data Architecture on AWS

Every organization wants an agile, performant, and cost-effective data platform that meets all their current and future business needs. Purpose-built AWS analytics services and their features play a big part in building such a modern data platform. This book brings to you all the design and architectural patterns that’ll help you achieve this goal.

BookAug 2023420 pages5

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

BookDec 2023240 pages

TinyML Cookbook

With over 70 project-based recipes, the TinyML Cookbook is a practical guide that will help you to get the most out of your microcontrollers. It provides a comprehensive understanding of the theoretical foundations while giving you hands-on experience training ML models for deployment on Arduino Nano 33 BLE Sense, Raspberry Pi Pico, and SparkFun RedBoard Artemis Nano microcontrollers.

BookNov 2023664 pages