You're reading from Machine Learning for Imbalanced Data

Product typeBook

Published inNov 2023

Reading LevelBeginner

PublisherPackt

ISBN-139781801070836

Edition1st Edition

Languages

Rust

Tools

TensorFlow Lite

Concepts

Data Science

Authors (2):

Kumar Abhishek

Dr. Mounir Abdelaziz

View More author details

Data-Level Deep Learning Methods

You learned about various sampling methods in the previous chapters. Collectively, we call these methods data-level methods in this book. These methods include random undersampling, random oversampling, NearMiss, and SMOTE. We also explored how these methods work with classical machine learning algorithms.

In this chapter, we’ll explore how to apply familiar sampling methods to deep learning models. Deep learning offers unique opportunities to enhance these methods further. We’ll delve into elegant techniques to combine deep learning with oversampling and undersampling. Additionally, we’ll learn how to implement various sampling methods with a basic neural network. We’ll also cover dynamic sampling, which involves adjusting the data sample across multiple training iterations, using varying balancing ratios for each iteration. Then, we will learn to use some data augmentation techniques for both images and text. We’...

Technical requirements

Similar to prior chapters, we will continue to utilize common libraries such as torch, torchvision, numpy, and scikit-learn. We will also use nlpaug for NLP-related functionalities. The code and notebooks for this chapter are available on GitHub at https://github.com/PacktPublishing/Machine-Learning-for-Imbalanced-Data/tree/main/chapter07. You can open the GitHub notebooks using Google Colab by clicking on the Open in Colab icon at the top of the chapter’s notebook, or by launching it from https://colab.research.google.com, using the GitHub URL of the notebook.

Preparing the data

In this chapter, we are going to use the classic MNIST dataset. This dataset contains 28-pixel x 28-pixel images of handwritten digits. The task for the model is to take an image as input and identify the digit in the image. We will use PyTorch, a popular deep-learning library, to demonstrate the algorithms. Let’s prepare the data now.

The first step in the process will be to import the libraries. We will need NumPy (as we deal with numpy arrays), torchvision (to load MNIST data), torch, random, and copy libraries.

Next, we can download the MNIST data from torchvision.datasets. The torchvision library is a part of the PyTorch framework, which contains datasets, models, and common image transformers for computer vision tasks. The following code will download the MNIST dataset from this library:

img_transform = torchvision.transforms.ToTensor()
trainset = torchvision.datasets.MNIST(\
    root='/tmp/mnist', train=True,\...

Sampling techniques for deep learning models

In this section, we’ll explore some sampling methods, such as random oversampling and weighted sampling, for deep learning models. We’ll then transition into data augmentation techniques, which bolster model robustness and mitigate dataset limitations. While large datasets are ideal for deep learning, real-world constraints often make them hard to obtain. We will also look at some advanced augmentations, such as CutMix and MixUp. We’ll start with standard methods before discussing these advanced techniques.

Random oversampling

Here, we will apply the plain old random oversampling we learned in Chapter 2, Oversampling Methods, but using image data as input to a neural network. The basic idea is to duplicate samples from the minority classes randomly until we end up with an equal number of samples from each class. This technique often performs better than no sampling.

Tip

Make sure to train the model for enough...

Data-level techniques for text classification

Data imbalance, wherein certain classes in a dataset are underrepresented, is not just an issue confined to image or structured data domains. In NLP, imbalanced datasets can lead to biased models that might perform well on the majority class but are likely to misclassify underrepresented ones. To address this challenge, numerous strategies have been devised.

In NLP, data augmentation can boost model performance, especially with limited training data. Table 7.3 categorizes the various data augmentation techniques for text data:

Discussion of other data-level deep learning methods and their key ideas

In addition to the methods previously discussed, there is a rich array of other techniques specifically designed to address imbalanced data challenges. This section provides a high-level overview of these alternative approaches, each offering unique insights and potential advantages. While we will only touch upon their key ideas, we encourage you to delve deeper into the literature and explore them further if you find these techniques intriguing.

Two-phase learning

Two-phase learning [16][17] is a technique designed to enhance the performance of minority classes in multi-class classification problems, without compromising the performance of majority classes. The process involves two training phases:

In the first phase, a deep learning model is first trained on the dataset, which is balanced with respect to each class. Balancing can be done using sampling techniques such as random oversampling or...

Summary

The transition of methods to handle data imbalance from classical machine learning models to deep learning models can pose unique challenges, primarily due to the distinct types of data that these models have to work with. Classical machine learning models typically deal with structured, tabular data, whereas deep learning models often grapple with unstructured data, such as images, text, audio, and video. This chapter explored how to adapt sampling techniques to work with deep learning models. To facilitate this, we used an imbalanced version of the MNIST dataset to train a model, which is then employed in conjunction with various oversampling methods.

Incorporating random oversampling with deep learning models involves duplicating samples from minority classes randomly, until each class has an equal number of samples. This is usually performed using APIs from libraries such as imbalanced-learn, Keras, TensorFlow, or PyTorch, which work together seamlessly for this purpose...

Questions

Apply Mixup interpolation to the Kaggle spam detection NLP dataset used in the chapter. See if Mixup helps to improve the model performance. You can refer to the paper Augmenting Data with Mixup for Sentence Classification: An Empirical Study by Guo et al. (https://arxiv.org/pdf/1905.08941.pdf) for further reading.
Refer to the FMix paper [21] and implement the FMix augmentation technique. Apply it to the Caltech101 dataset. See whether model performance improves by using FMix over the baseline model performance.
Apply the EOS technique described in the chapter to the CIFAR-10-LT (the long-tailed version of CIFAR-10) dataset, and see whether the model performance improves for the most imbalanced classes.
Apply the MDSA techniques we studied in this chapter to the CIFAR-10-LT dataset, and see whether the model performance improves for the most imbalanced classes.

References

Samira Pouyanfar, Yudong Tao, Anup Mohan, Haiman Tian, Ahmed S. Kaseb, Kent Gauen, Ryan Dailey, Sarah Aghajanzadeh, Yung-Hsiang Lu, Shu-Ching Chen, and Mei-Ling Shyu. 2018. Dynamic Sampling in Convolutional Neural Networks for Imbalanced Data Classification. In 2018 IEEE Conference on Multimedia Information Processing and Retrieval (MIPR), pages 112–117, Miami, FL, April. IEEE.
LeNet-5 paper, Gradient-based learning applied to document classification: http://vision.stanford.edu/cs598_spring07/papers/Lecun98.pdf.
AlexNet paper, ImageNet Classification with Deep Convolutional Neural Networks: https://papers.nips.cc/paper/2012/hash/c399862d3b9d6b76c8436e924a68c45b-Abstract.html.
Leveraging Real-Time User Actions to Personalize Etsy Ads (2023): https://www.etsy.com/codeascraft/leveraging-real-time-user-actions-to-personalize-etsy-ads.
Automated image tagging at Booking.com (2017): https://booking.ai/automated-image-tagging-at-booking-com-7704f27dcc8b...

The rest of the chapter is locked

You have been reading a chapter from

Machine Learning for Imbalanced Data

Published in: Nov 2023Publisher: PacktISBN-13: 9781801070836

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at $15.99/month. Cancel anytime

Authors (2)

Kumar Abhishek

Kumar Abhishek is a seasoned Senior Machine Learning Engineer at Expedia Group, US, specializing in risk analysis and fraud detection for Expedia brands. With over a decade of experience at companies such as Microsoft, Amazon, and a Bay Area startup, Kumar holds an MS in Computer Science from the University of Florida.
Read more about Kumar Abhishek

Dr. Mounir Abdelaziz

Dr. Mounir Abdelaziz is a deep learning researcher specializing in computer vision applications. He holds a Ph.D. in computer science and technology from Central South University, China. During his Ph.D. journey, he developed innovative algorithms to address practical computer vision challenges. He has also authored numerous research articles in the field of few-shot learning for image classification.
Read more about Dr. Mounir Abdelaziz

Personalised recommendations for you

Based on your interests and search pattern

Et al.

Ever wonder why speech recognition systems don't understand the Scottish accent, or what would happen if an astronaut only ate mac 'n' cheese, or other spurious reflections you'd have at a bar? We did, then collated those deliberations into absurd research articles with fake figures and methodologies inspired by even more fictionally absurd studies.

BookAug 2023230 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages4

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages1

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Mastering Tableau 2023

This book is a comprehensive resource to mastering your Tableau skills and becoming a BI expert. As you progress, you will learn how to build advanced dashboards and improve your storytelling to derive key business insight, as well as make you well-versed with advanced functionalities of Tableau in the business intelligence domain.

BookAug 2023684 pages

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages5

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages2

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

BookOct 2023636 pages5

Modern Data Architecture on AWS

Every organization wants an agile, performant, and cost-effective data platform that meets all their current and future business needs. Purpose-built AWS analytics services and their features play a big part in building such a modern data platform. This book brings to you all the design and architectural patterns that’ll help you achieve this goal.

BookAug 2023420 pages5

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

BookDec 2023240 pages

TinyML Cookbook

With over 70 project-based recipes, the TinyML Cookbook is a practical guide that will help you to get the most out of your microcontrollers. It provides a comprehensive understanding of the theoretical foundations while giving you hands-on experience training ML models for deployment on Arduino Nano 33 BLE Sense, Raspberry Pi Pico, and SparkFun RedBoard Artemis Nano microcontrollers.

BookNov 2023664 pages

Level	Method	Description	Example techniques
Character level	Noise	Introducing randomness at the character level	Jumbling characters
Character level	Rule-based	...