Reader small image

You're reading from  Machine Learning for Imbalanced Data

Product typeBook
Published inNov 2023
Reading LevelBeginner
PublisherPackt
ISBN-139781801070836
Edition1st Edition
Languages
Concepts
Right arrow
Authors (2):
Kumar Abhishek
Kumar Abhishek
author image
Kumar Abhishek

Kumar Abhishek is a seasoned Senior Machine Learning Engineer at Expedia Group, US, specializing in risk analysis and fraud detection for Expedia brands. With over a decade of experience at companies such as Microsoft, Amazon, and a Bay Area startup, Kumar holds an MS in Computer Science from the University of Florida.
Read more about Kumar Abhishek

Dr. Mounir Abdelaziz
Dr. Mounir Abdelaziz
author image
Dr. Mounir Abdelaziz

Dr. Mounir Abdelaziz is a deep learning researcher specializing in computer vision applications. He holds a Ph.D. in computer science and technology from Central South University, China. During his Ph.D. journey, he developed innovative algorithms to address practical computer vision challenges. He has also authored numerous research articles in the field of few-shot learning for image classification.
Read more about Dr. Mounir Abdelaziz

View More author details
Right arrow

Data-Level Deep Learning Methods

You learned about various sampling methods in the previous chapters. Collectively, we call these methods data-level methods in this book. These methods include random undersampling, random oversampling, NearMiss, and SMOTE. We also explored how these methods work with classical machine learning algorithms.

In this chapter, we’ll explore how to apply familiar sampling methods to deep learning models. Deep learning offers unique opportunities to enhance these methods further. We’ll delve into elegant techniques to combine deep learning with oversampling and undersampling. Additionally, we’ll learn how to implement various sampling methods with a basic neural network. We’ll also cover dynamic sampling, which involves adjusting the data sample across multiple training iterations, using varying balancing ratios for each iteration. Then, we will learn to use some data augmentation techniques for both images and text. We’...

Technical requirements

Similar to prior chapters, we will continue to utilize common libraries such as torch, torchvision, numpy, and scikit-learn. We will also use nlpaug for NLP-related functionalities. The code and notebooks for this chapter are available on GitHub at https://github.com/PacktPublishing/Machine-Learning-for-Imbalanced-Data/tree/main/chapter07. You can open the GitHub notebooks using Google Colab by clicking on the Open in Colab icon at the top of the chapter’s notebook, or by launching it from https://colab.research.google.com, using the GitHub URL of the notebook.

Preparing the data

In this chapter, we are going to use the classic MNIST dataset. This dataset contains 28-pixel x 28-pixel images of handwritten digits. The task for the model is to take an image as input and identify the digit in the image. We will use PyTorch, a popular deep-learning library, to demonstrate the algorithms. Let’s prepare the data now.

The first step in the process will be to import the libraries. We will need NumPy (as we deal with numpy arrays), torchvision (to load MNIST data), torch, random, and copy libraries.

Next, we can download the MNIST data from torchvision.datasets. The torchvision library is a part of the PyTorch framework, which contains datasets, models, and common image transformers for computer vision tasks. The following code will download the MNIST dataset from this library:

img_transform = torchvision.transforms.ToTensor()
trainset = torchvision.datasets.MNIST(\
    root='/tmp/mnist', train=True,\...

Sampling techniques for deep learning models

In this section, we’ll explore some sampling methods, such as random oversampling and weighted sampling, for deep learning models. We’ll then transition into data augmentation techniques, which bolster model robustness and mitigate dataset limitations. While large datasets are ideal for deep learning, real-world constraints often make them hard to obtain. We will also look at some advanced augmentations, such as CutMix and MixUp. We’ll start with standard methods before discussing these advanced techniques.

Random oversampling

Here, we will apply the plain old random oversampling we learned in Chapter 2, Oversampling Methods, but using image data as input to a neural network. The basic idea is to duplicate samples from the minority classes randomly until we end up with an equal number of samples from each class. This technique often performs better than no sampling.

Tip

Make sure to train the model for enough...

Data-level techniques for text classification

Data imbalance, wherein certain classes in a dataset are underrepresented, is not just an issue confined to image or structured data domains. In NLP, imbalanced datasets can lead to biased models that might perform well on the majority class but are likely to misclassify underrepresented ones. To address this challenge, numerous strategies have been devised.

In NLP, data augmentation can boost model performance, especially with limited training data. Table 7.3 categorizes the various data augmentation techniques for text data:

Discussion of other data-level deep learning methods and their key ideas

In addition to the methods previously discussed, there is a rich array of other techniques specifically designed to address imbalanced data challenges. This section provides a high-level overview of these alternative approaches, each offering unique insights and potential advantages. While we will only touch upon their key ideas, we encourage you to delve deeper into the literature and explore them further if you find these techniques intriguing.

Two-phase learning

Two-phase learning [16][17] is a technique designed to enhance the performance of minority classes in multi-class classification problems, without compromising the performance of majority classes. The process involves two training phases:

  1. In the first phase, a deep learning model is first trained on the dataset, which is balanced with respect to each class. Balancing can be done using sampling techniques such as random oversampling or...

Summary

The transition of methods to handle data imbalance from classical machine learning models to deep learning models can pose unique challenges, primarily due to the distinct types of data that these models have to work with. Classical machine learning models typically deal with structured, tabular data, whereas deep learning models often grapple with unstructured data, such as images, text, audio, and video. This chapter explored how to adapt sampling techniques to work with deep learning models. To facilitate this, we used an imbalanced version of the MNIST dataset to train a model, which is then employed in conjunction with various oversampling methods.

Incorporating random oversampling with deep learning models involves duplicating samples from minority classes randomly, until each class has an equal number of samples. This is usually performed using APIs from libraries such as imbalanced-learn, Keras, TensorFlow, or PyTorch, which work together seamlessly for this purpose...

Questions

  1. Apply Mixup interpolation to the Kaggle spam detection NLP dataset used in the chapter. See if Mixup helps to improve the model performance. You can refer to the paper Augmenting Data with Mixup for Sentence Classification: An Empirical Study by Guo et al. (https://arxiv.org/pdf/1905.08941.pdf) for further reading.
  2. Refer to the FMix paper [21] and implement the FMix augmentation technique. Apply it to the Caltech101 dataset. See whether model performance improves by using FMix over the baseline model performance.
  3. Apply the EOS technique described in the chapter to the CIFAR-10-LT (the long-tailed version of CIFAR-10) dataset, and see whether the model performance improves for the most imbalanced classes.
  4. Apply the MDSA techniques we studied in this chapter to the CIFAR-10-LT dataset, and see whether the model performance improves for the most imbalanced classes.

References

  1. Samira Pouyanfar, Yudong Tao, Anup Mohan, Haiman Tian, Ahmed S. Kaseb, Kent Gauen, Ryan Dailey, Sarah Aghajanzadeh, Yung-Hsiang Lu, Shu-Ching Chen, and Mei-Ling Shyu. 2018. Dynamic Sampling in Convolutional Neural Networks for Imbalanced Data Classification. In 2018 IEEE Conference on Multimedia Information Processing and Retrieval (MIPR), pages 112–117, Miami, FL, April. IEEE.
  2. LeNet-5 paper, Gradient-based learning applied to document classification: http://vision.stanford.edu/cs598_spring07/papers/Lecun98.pdf.
  3. AlexNet paper, ImageNet Classification with Deep Convolutional Neural Networks: https://papers.nips.cc/paper/2012/hash/c399862d3b9d6b76c8436e924a68c45b-Abstract.html.
  4. Leveraging Real-Time User Actions to Personalize Etsy Ads (2023): https://www.etsy.com/codeascraft/leveraging-real-time-user-actions-to-personalize-etsy-ads.
  5. Automated image tagging at Booking.com (2017): https://booking.ai/automated-image-tagging-at-booking-com-7704f27dcc8b...
lock icon
The rest of the chapter is locked
You have been reading a chapter from
Machine Learning for Imbalanced Data
Published in: Nov 2023Publisher: PacktISBN-13: 9781801070836
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Authors (2)

author image
Kumar Abhishek

Kumar Abhishek is a seasoned Senior Machine Learning Engineer at Expedia Group, US, specializing in risk analysis and fraud detection for Expedia brands. With over a decade of experience at companies such as Microsoft, Amazon, and a Bay Area startup, Kumar holds an MS in Computer Science from the University of Florida.
Read more about Kumar Abhishek

author image
Dr. Mounir Abdelaziz

Dr. Mounir Abdelaziz is a deep learning researcher specializing in computer vision applications. He holds a Ph.D. in computer science and technology from Central South University, China. During his Ph.D. journey, he developed innovative algorithms to address practical computer vision challenges. He has also authored numerous research articles in the field of few-shot learning for image classification.
Read more about Dr. Mounir Abdelaziz

Level

Method

Description

Example techniques

Character level

Noise

Introducing randomness at the character level

Jumbling characters

Rule-based

...