You're reading from Machine Learning for Imbalanced Data

Product type Book

Published in Nov 2023

Publisher Packt

ISBN-13 9781801070836

Pages 344 pages

Edition 1st Edition

Languages

Rust

Concepts

Data Science

Authors (2):

Kumar Abhishek

Dr. Mounir Abdelaziz

View More author details

Table of Contents (15) Chapters

Preface

Chapter 1: Introduction to Data Imbalance in Machine Learning

Chapter 2: Oversampling Methods

Chapter 3: Undersampling Methods

Chapter 4: Ensemble Methods

Chapter 5: Cost-Sensitive Learning

Chapter 6: Data Imbalance in Deep Learning

Chapter 7: Data-Level Deep Learning Methods

Chapter 8: Algorithm-Level Deep Learning Techniques

Chapter 9: Hybrid Deep Learning Methods

Chapter 10: Model Calibration

Assessments

Index

Why subscribe?

Other Books You May Enjoy

Appendix: Machine Learning Pipeline in Production

Ensemble Methods

Think of a top executive at a major company. They don’t make decisions on their own. Throughout the day, they need to make numerous critical decisions. How do they make those choices? Not alone, but by consulting their advisors.

Let’s say that an executive consults five different advisors from different departments, each proposing a slightly different solution based on their expertise, skills, and domain knowledge. To make the most effective decision, the executive combines the insights and opinions of all five advisors to create a hybrid solution that incorporates the best parts of each proposal. This scenario illustrates the concept of ensemble methods, where multiple weak classifiers are combined to create a stronger and more accurate classifier. By combining different approaches, ensemble methods can often achieve better performance than relying on a single classifier.

We can create a strong model through ensemble methods by combining the results...

Technical requirements

The Python notebooks for this chapter are available on GitHub at https://github.com/PacktPublishing/Machine-Learning-for-Imbalanced-Data/tree/master/chapter04. As usual, you can open the GitHub notebook using Google Colab by clicking on the Open in Colab icon at the top of this chapter’s notebook or by launching it from https://colab.research.google.com using the GitHub URL of the notebook.

In this chapter, we will continue to use a synthetic dataset generated using the make_classification API, just as we did in the previous chapters. Toward the end of this chapter, we will test the methods we learned in this chapter on some real datasets. Our full dataset contains 90,000 examples with a 1:99 imbalance ratio. Here is what the training dataset looks like:

Figure 4.2 – Plot of a dataset with a 1:99 imbalance ratio

With our imbalanced dataset ready to use, let’s look at the first ensembling method, called bagging...

Bagging techniques for imbalanced data

Imagine a business executive with thousands of confidential files regarding an important merger or acquisition. The analysts assigned to the case don’t have enough time to review all the files. Each can randomly select some files from the set and start reviewing them. Later, they can combine their insights in a meeting to draw conclusions.

This scenario is a metaphor for a process in machine learning called bagging [1], which is short for bootstrap aggregating. In bagging, much like the analysts in the previous scenario, we create several subsets of the original dataset, train a weak learner on each subset, and then aggregate their predictions.

Why use weak learners instead of strong learners? The rationale applies to both bagging and boosting methods (discussed later in this chapter). There are several reasons:

Speed: Weak learners are computationally efficient and inexpensive to execute.
Diversity: Weak learners are...

Boosting techniques for imbalanced data

Imagine two friends doing group study to solve their mathematics assignment. The first student is strong in most topics but weak in two topics: complex numbers and triangles. So, the first student asks the second student to spend more time on these two topics. Then, while solving the assignments, they combine their answers. Since the first student knows most of the topics well, they decided to give more weight to his answers to the assignment questions. What these two students are doing is the key idea behind boosting.

In bagging, we noticed that we could train all the classifiers in parallel. These classifiers are trained on a subset of the data, and all of them have an equal say at the time of prediction.

In boosting, the classifiers are trained one after the other. While every classifier learns from the whole data, points in the dataset are assigned different weights based on their difficulty of classification. Classifiers are also assigned...

Ensemble of ensembles

Can we combine boosting and bagging? As we saw earlier, in bagging, we create multiple subsets of data and then train classifiers on those datasets. We can treat AdaBoost as a classifier while doing bagging. The process is simple: first, we create the bags and then train different AdaBoost classifiers on each bag. Here, AdaBoost is an ensemble in itself. Thus, these models are called an ensemble of ensembles.

On top of having an ensemble of ensembles, we can also do undersampling (or oversampling) at the time of bagging. This gives us the benefits of bagging, boosting, and random undersampling (or oversampling) in a single model. We will discuss one such algorithm in this section, called EasyEnsemble. Since random undersampling doesn’t have significant overhead, both algorithms have training times similar to any other algorithm with the same number of weak classifiers.

EasyEnsemble

The EasyEnsemble algorithm [8] generates balanced datasets from...

Model performance comparison

The effectiveness of the techniques we’ve discussed so far can be highly dependent on the dataset they are applied to. In this section, we will conduct a comprehensive comparative analysis that compares the various techniques we have discussed so far while using the logistic regression model as a baseline. For a comprehensive review of the complete implementation, please consult the accompanying notebook available on GitHub.

The analysis spans four distinct datasets, each with its own characteristics and challenges:

Synthetic data with Sep: 0.5: A simulated dataset with moderate separation between classes, serving as a baseline to understand algorithm performance in simplified conditions.
Synthetic data with Sep: 0.9: Another synthetic dataset, but with a higher degree of separation, allowing us to examine how algorithms perform as class distinguishability improves.
Thyroid sick dataset: A real-world dataset (available to import...

Summary

Ensemble methods in machine learning create strong classifiers by combining results from multiple weak classifiers using approaches such as bagging and boosting. However, these methods assume balanced data and may struggle with imbalanced datasets. Combining ensemble methods with sampling methods such as oversampling and undersampling leads to techniques such as UnderBagging, OverBagging, and SMOTEBagging, all of which can help address imbalanced data issues.

Ensembles of ensembles, such as EasyEnsemble, combine boosting and bagging techniques to create powerful classifiers for imbalanced datasets.

Ensemble-based imbalance learning techniques can be an excellent addition to your toolkit. The ones based on KNN, viz., SMOTEBoost, and RAMOBoost can be slow. However, the ensembles based on random undersampling and random oversampling are less costly. Also, boosting methods are found to sometimes work better than bagging methods in the case of imbalanced data. We can combine...

Questions

Try using RUSBoostClassifier on the abalone_19 dataset and compare the performance with other techniques from the previous chapters.
What is the difference between the BalancedRandomForestClassifier and BalancedBaggingClassifier classes in the imbalanced-learn library?

References

L. Breiman, Bagging predictors, Mach Learn, vol. 24, no. 2, pp. 123–140, Aug. 1996, doi: 10.1007/BF00058655, https://link.springer.com/content/pdf/10.1007/BF00058655.pdf.
(The paper that introduced OverBagging, UnderBagging, and SMOTEBagging) S. Wang and X. Yao, Diversity analysis on imbalanced data sets by using ensemble models, in 2009 IEEE Symposium on Computational Intelligence and Data Mining, Nashville, TN, USA: IEEE, Mar. 2009, pp. 324–331. doi: 10.1109/CIDM.2009.4938667, https://www.cs.bham.ac.uk/~wangsu/documents/papers/CIDMShuo.pdf.
Live Site Incident escalation forecast (2023), https://medium.com/data-science-at-microsoft/live-site-incident-escalation-forecast-566763a2178
L. Breiman, Random Forests, Machine Learning, vol. 45, no. 1, pp. 5–32, 2001, doi: 10.1023/A:1010933404324, https://link.springer.com/content/pdf/10.1023/A:1010933404324.pdf.
(The paper that introduced the RUSBoost algorithm) C. Seiffert, T. M. Khoshgoftaar...