WORKING WITH IMBALANCED DATASETS
Imbalanced datasets contain at least once class that has significantly more values than another class in the dataset. For example, if class A has 99% of the data and class B has 1%, which classification algorithm would you use?
Unfortunately, classification algorithms don’t work as well with highly imbalanced datasets. However, there are various techniques that you can use in order to reduce the imbalance in a dataset. Regardless of the technique that you decide to use, keep in mind the following detail: resampling techniques are only applied to the training data (not the validation data or the test data).
In addition, if you perform k-fold cross validation on a training set, then oversampling is performed in each fold during the training step. In order to avoid data leakage, make sure that you do not perform oversampling prior to k-fold cross validation.
Data Sampling Techniques
Data sampling techniques reduce the imbalance in an imbalanced datasets...