Search icon
Arrow left icon
All Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletters
Free Learning
Arrow right icon
Machine Learning for Imbalanced Data

You're reading from  Machine Learning for Imbalanced Data

Product type Book
Published in Nov 2023
Publisher Packt
ISBN-13 9781801070836
Pages 344 pages
Edition 1st Edition
Languages
Authors (2):
Kumar Abhishek Kumar Abhishek
Profile icon Kumar Abhishek
Dr. Mounir Abdelaziz Dr. Mounir Abdelaziz
Profile icon Dr. Mounir Abdelaziz
View More author details

Table of Contents (15) Chapters

Preface Chapter 1: Introduction to Data Imbalance in Machine Learning Chapter 2: Oversampling Methods Chapter 3: Undersampling Methods Chapter 4: Ensemble Methods Chapter 5: Cost-Sensitive Learning Chapter 6: Data Imbalance in Deep Learning Chapter 7: Data-Level Deep Learning Methods Chapter 8: Algorithm-Level Deep Learning Techniques Chapter 9: Hybrid Deep Learning Methods Chapter 10: Model Calibration Assessments Index Other Books You May Enjoy Appendix: Machine Learning Pipeline in Production

Types of dataset and splits

Typically, we train our model on the training set and test the model on an independent unseen dataset called the test set. We do this to do a fair evaluation of the model. If we don’t do this and train the model on the full dataset and evaluate the model on the same dataset, we don’t know how good the model would do on unseen data, plus the model will likely be overfitted.

We may encounter three kinds of datasets in machine learning:

  • Training set: A dataset on which the model is trained.
  • Validation set: A dataset used for tuning the hyperparameters of the model. A validation set is often referred to as a development set.
  • Evaluation set or test set: A dataset used for evaluating the performance of the model.

When working with small example datasets, it’s common to allocate 80% of the data for the training set, 10% for the validation set, and 10% for the test set. However, the specific ratio between training and test sets is not as important as ensuring that the test set is large enough to provide statistically meaningful evaluation results. In the context of big data, a split of 98%, 1%, and 1% for training, validation, and test sets, respectively, could be appropriate.

Often, people don’t have a dedicated validation set for hyperparameter tuning and refer to the test set as an evaluation set. This can happen when the hyperparameter tuning is not performed as a part of the regular training cycle and is a one-off activity.

Cross-validation

Cross-validation can be a confusing term to guess its meaning. Breaking it down: cross + validation, so it’s some sort of validation on an extended (cross) something. Something here is the test set for us.

Let’s see what cross-validation is:

  • Cross-validation is a technique that’s used to estimate how accurately a model will perform in practice
  • It is typically used to detect overfitting – that is, failing to generalize patterns in data, particularly when the amount of data may be limited

Let’s look at the different types of cross-validation:

  • Holdout: In the holdout method, we randomly assign data points to two sets, usually called the training set and the test set, respectively. We then train (build a model) on the training set and test (evaluate its performance) on the test set.
  • k-fold: This works as follows:
    • We randomly shuffle the data.
    • We divide all the data into k parts, also known as folds. We train the model on k-1 folds and evaluate it on the remaining fold. We record the performance of this model using our chosen model evaluation metric, then discard this model.
    • We repeat this process k times, each time holding out a different subset for testing. We take an average of the evaluation metric values (for example, accuracy) from all the previous models. This average represents the overall performance measure of the model.

k-fold cross-validation is mainly used when you have limited data points, say 100 points. Using 5 or 10 folds is the most common when doing cross-validation.

Let’s look at the common evaluation metrics in machine learning, with a special focus on the ones relevant to problems with imbalanced data.

You have been reading a chapter from
Machine Learning for Imbalanced Data
Published in: Nov 2023 Publisher: Packt ISBN-13: 9781801070836
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime}