You're reading from Machine Learning for Imbalanced Data

Product type Book

Published in Nov 2023

Publisher Packt

ISBN-13 9781801070836

Pages 344 pages

Edition 1st Edition

Languages

Rust

Concepts

Data Science

Authors (2):

Kumar Abhishek

Dr. Mounir Abdelaziz

View More author details

Table of Contents (15) Chapters

Preface

Chapter 1: Introduction to Data Imbalance in Machine Learning

Chapter 2: Oversampling Methods

Chapter 3: Undersampling Methods

Chapter 4: Ensemble Methods

Chapter 5: Cost-Sensitive Learning

Chapter 6: Data Imbalance in Deep Learning

Chapter 7: Data-Level Deep Learning Methods

Chapter 8: Algorithm-Level Deep Learning Techniques

Chapter 9: Hybrid Deep Learning Methods

Chapter 10: Model Calibration

Assessments

Index

Why subscribe?

Other Books You May Enjoy

Appendix: Machine Learning Pipeline in Production

Types of dataset and splits

Typically, we train our model on the training set and test the model on an independent unseen dataset called the test set. We do this to do a fair evaluation of the model. If we don’t do this and train the model on the full dataset and evaluate the model on the same dataset, we don’t know how good the model would do on unseen data, plus the model will likely be overfitted.

We may encounter three kinds of datasets in machine learning:

Training set: A dataset on which the model is trained.
Validation set: A dataset used for tuning the hyperparameters of the model. A validation set is often referred to as a development set.
Evaluation set or test set: A dataset used for evaluating the performance of the model.

When working with small example datasets, it’s common to allocate 80% of the data for the training set, 10% for the validation set, and 10% for the test set. However, the specific ratio between training and test sets is not as important as ensuring that the test set is large enough to provide statistically meaningful evaluation results. In the context of big data, a split of 98%, 1%, and 1% for training, validation, and test sets, respectively, could be appropriate.

Often, people don’t have a dedicated validation set for hyperparameter tuning and refer to the test set as an evaluation set. This can happen when the hyperparameter tuning is not performed as a part of the regular training cycle and is a one-off activity.

Cross-validation

Cross-validation can be a confusing term to guess its meaning. Breaking it down: cross + validation, so it’s some sort of validation on an extended (cross) something. Something here is the test set for us.

Let’s see what cross-validation is:

Cross-validation is a technique that’s used to estimate how accurately a model will perform in practice
It is typically used to detect overfitting – that is, failing to generalize patterns in data, particularly when the amount of data may be limited

Let’s look at the different types of cross-validation:

Holdout: In the holdout method, we randomly assign data points to two sets, usually called the training set and the test set, respectively. We then train (build a model) on the training set and test (evaluate its performance) on the test set.
k-fold: This works as follows:
- We randomly shuffle the data.
- We divide all the data into k parts, also known as folds. We train the model on k-1 folds and evaluate it on the remaining fold. We record the performance of this model using our chosen model evaluation metric, then discard this model.
- We repeat this process k times, each time holding out a different subset for testing. We take an average of the evaluation metric values (for example, accuracy) from all the previous models. This average represents the overall performance measure of the model.

k-fold cross-validation is mainly used when you have limited data points, say 100 points. Using 5 or 10 folds is the most common when doing cross-validation.

Let’s look at the common evaluation metrics in machine learning, with a special focus on the ones relevant to problems with imbalanced data.

You're reading from Machine Learning for Imbalanced Data

Table of Contents (15) Chapters

Types of dataset and splits

Cross-validation

Authors (2)

Personalised recommendations for you