Machine Learning (ML) models need to be thoroughly evaluated to ensure they will work in production. We have to ensure the model is not memorizing the training data and also ensure it learns enough from the given training data. Choosing the appropriate evaluation method is also critical when we want to perform hyperparameter tuning at a later stage.
In this chapter, we'll learn about all the important things we need to know when it comes to evaluating ML models. First, we need to understand the concept of overfitting. Then, we will look at the idea of splitting data into train, validation, and test sets. Additionally, we'll learn about the difference between random and stratified splits and when to use each of them.
We'll discuss the concept of cross-validation and its numerous variations of strategy: k-fold repeated k-fold, Leave One Out (LOO), Leave P Out (LPO), and a specific strategy when dealing with time-series data, called time-series cross-validation. We'll also learn how to implement each of the evaluation strategies using the Scikit-Learn package.
By the end of this chapter, you will have a good understanding of why choosing a proper evaluation strategy is critical in the ML model development life cycle. Also, you will be aware of numerous evaluation strategies and will be able to choose the most appropriate one for your situation. Furthermore, you will also be able to implement each of the evaluation strategies using the Scikit-Learn package.
In this chapter, we're going to cover the following main topics:
- Understanding the concept of overfitting
- Creating training, validation, and test sets
- Exploring random and stratified split
- Discovering k-fold cross-validation
- Discovering repeated k-fold cross-validation
- Discovering LOO cross-validation
- Discovering LPO cross-validation
- Discovering time-series cross-validation
We will learn how to implement each of the evaluation strategies using the Scikit-Learn package. To ensure that you can reproduce the code examples in this chapter, you will need the following:
- Python 3 (version 3.7 or above)
- The pandas package installed (version 1.3.4 or above)
- The Scikit-Learn package installed (version 1.0.1 or above)
All of the code examples for this chapter can be found on GitHub at https://github.com/PacktPublishing/Hyperparameter-Tuning-with-Python/blob/main/01_Evaluating-Machine-Learning-Models.ipynb.
Understanding the concept of overfitting
Overfitting occurs when the trained ML model learns too much from the given training data. In this situation, the trained model successfully gets a high evaluation score on the training data but a far lower score on new, unseen data. In other words, the trained ML model fails to generalize the knowledge learned from the training data to the unseen data.
So, how exactly does the trained ML model get decent performance on the training data but fail to give a reasonable performance on unseen data? Well, that happens when the model tries too hard to achieve high performance on the training data and has picked up knowledge that is only applicable to that specific training data. Of course, this will negatively impact the model's ability to generalize, which results in bad performance when the model is evaluated on unseen data.
To detect whether our trained ML model faces an overfitting issue, we can monitor the performance of our model on the training data versus unseen data. Performance can be defined as the loss value of our model or metrics that we care about, for example, accuracy, precision, and the mean absolute error. If the performance of the training data keeps getting better, while the performance on the unseen data starts to become stagnant or even gets worse, then this is a sign of an overfitting issue (see Figure 1.1):
Figure 1.1 – The model's performance on training data versus unseen data (overfitting)
The preceding diagram image has been reproduced according to the license specified: https://commons.wikimedia.org/wiki/File:Overfitting_svg.svg.
Now that you are aware of the overfitting problem, we need to learn how to prevent this from happening in our ML development life cycle. We will discuss this in the following sections.
Creating training, validation, and test sets
We understand that overfitting can be detected by monitoring the model's performance on the training data versus the unseen data, but what exactly is unseen data? Is it just random data that has not yet been seen by the model during the training phase?
Unseen data is a portion of our original complete data that was not seen by the model during the training phase. We usually refer to this unseen data as the test set. Let's imagine you have 100,000 samples of data, to begin with; you can take out a portion of the data, let's say 10% of it, to become the test set. So, now we have 90,000 samples as the training set and 10,000 samples as the testing set.
However, it is better to not just split our original data into train and test sets but also into a validation set, especially when we want to perform hyperparameter tuning on our model. Let's say that out of 100,000 original samples, we held out 10% of it to become the validation set and another 10% to become the test set. Therefore, we will have 80,000 samples as the train set, 10,000 samples as the validation set, and 10,000 samples as the test set.
You might be wondering why do we need a validation set apart from the test set. Actually, we do not need it if we do not want to perform hyperparameter tuning or any other model-centric approaches. This is because the purpose of having a validation set is to have an unbiased evaluation of the test set using the final version of the trained model.
A validation set can help us to get an unbiased evaluation of the test set because we only incorporate the validation set during the hyperparameter tuning phase. Once we finish the hyperparameter tuning phase and get the final model configuration, we can then evaluate our model on the purely unseen data, which is called the test set.
If you are going to perform any data preprocessing steps (for example, missing value imputation, feature engineering, standardization, label encoding, and more), you have to build the function based on the train set and then apply it to the validation and test set. Do not perform those data preprocessing steps on the full original data (before data splitting). That's because it might lead to a data leakage problem.
There is no specific rule when it comes to choosing the proportions for each of the train, validation, and test sets. You have to choose the split proportion by yourself based on the condition you are faced with. However, the common splitting proportion used by the data science community is 8:2 or 9:1 for the train set and the validation and test set, respectively. Usually, the validation and test set will have a proportion of 1:1. Therefore, the common splitting proportion is 8:1:1 or 9:0.5:0.5 for the train, validation, and test sets, respectively.
Now that we are aware of the train, validation, and test set concept, we need to learn how to build those sets. Do we just randomly split our original data into three sets? Or can we also apply some predefined rules? In the next section, we will explore this topic in more detail.
Exploring random and stratified splits
The most straightforward way (but not entirely a correct way) to split our original full data into train, validation, and test sets is by choosing the proportions for each set and then directly splitting them into three sets based on the order of the index.
For instance, the original full data has 100,000 samples, and we want to split this into train, validation, and test sets with a proportion of 8:1:1. Then, the training set will be the samples from index 1 until 80,000. The validation and test set will be the index from 81,000 until 90,000 and 91,000 until 100,000, respectively.
So, what's wrong with that approach? There is nothing wrong with that approach as long as the original full data is shuffled. It might cause a problem when there is some kind of pattern between the indices of the samples.
For instance, we have data consisting of 10,000 samples and 3 columns. The first and second columns contain weight and height information, respectively. The third column contains the "weight status" class (for example, underweight, normal weight, overweight, and obesity). Our task is to build an ML classifier model to predict what the "weight status" class of a person is, given their weight and height. It is not impossible for the data to be given to us in the condition that it was ordered based on the third column. So, the first 80,000 rows only consist of the underweight and normal weight classes. In comparison, the overweight and obesity classes are only located in the last 20,000 rows. If this is the case, and we apply the data splitting logic from earlier, then there is no way our classifier can predict a new person has the overweight or obesity "weight status" classes. Why? Because our classifier has never seen those classes before during the training phase!
Therefore, it is very important to ensure the original full data is shuffled in the first place, and essentially, this is what we mean by the random split. Random split works by first shuffling the original full data and then splitting it into the train, validation, and test sets based on the order of the index.
There is also another splitting logic called the stratified split. This logic ensures that the train, validation, and test set will get a similar proportion number of samples for each target class found in the original full data.
Using the same "weight status" class prediction case example, let's say that we found that the proportion of each class in the full original data is 3:5:1.5:0.5 for underweight, normal weight, overweight, and obese, respectively. The stratified split logic will ensure that we can find a similar proportion of those classes in the train, validation, and test sets. So, out of 80,000 samples of the train set, around 24,000 samples are in the underweight class, around 40,000 samples are in the normal weight class, around 12,000 samples are overweight, and around 4,000 samples are in the obesity class. This will also be applied to the validation and test set.
The remaining question is understanding when it is the right time to use the random split/stratified split logic. Often, the stratified split logic is used when we are faced with an imbalanced class problem. However, it is also often used when we want to make sure that we have a similar proportion of samples based on a specific variable (not necessarily the target class). If you are not faced with this kind of situation, then the random split is the go-to logic that you can always choose.
To implement both of the data splitting logics, you can write the code by yourself from scratch or utilize the well-known package called Scikit-Learn. The following is an example to perform a random split with a proportion of 8:1:1:
from sklearn.model_selection import train_test_split df_train, df_unseen = train_test_split(df, test_size=0.2, random_state=0) df_val, df_test = train_test_split(df_unseen, test_size=0.5, random_state=0)
df variable is our complete original data that was stored in the Pandas DataFrame object. The
train_test_split function splits the Pandas DataFrame, array, or matrix into shuffled train and test sets. In lines 2–3, first, we split the original full data into
df_unseen with a proportion of 8:2, as specified by the
test_size argument. Then, we split
df_test with a proportion of 1:1.
To perform the stratify split logic, you can just add the
stratify argument to the
train_test_split function and fill it with the target array:
df_train, df_unseen = train_test_split(df, test_size=0.2, random_state=0, stratify=df['class']) df_val, df_test = train_test_split(df_unseen, test_size=0.5, random_state=0, stratify=df_unseen['class'])
stratify argument will ensure the data is split in the stratified fashion based on the given target array.
In this section, we have learned the importance of shuffling the original full data before performing data splitting and also understand the difference between the random and stratified split, as well as when to use each of them. In the next section, we will start learning variations of the data splitting strategies and how to implement each of them using the Scikit-learn package.
Discovering k-fold cross-validation
Cross-validation is a way to evaluate our ML model by performing multiple evaluations on our original full data via a resampling procedure. This is a variation from the vanilla train-validation-test split that we learned about in previous sections. Additionally, the concept of random and stratified splits can be applied in cross-validation.
In cross-validation, we perform multiple splits for the train and validation sets, where each split is usually referred to Fold. What about the test set? Well, it still acts as the purely unseen data where we can test the final model configuration on it. Therefore, in the beginning, it is only separated once from the train and validation set.
There are several variations of the cross-validation strategy. The first one is called k-fold cross-validation. It works by performing k times of training and evaluation with a proportion of (k-1):1 for the train and validation set, respectively, in each fold. To have a clearer understanding of k-fold cross-validation, please refer to Figure 1.2:
Figure 1.2 – K-fold cross-validation
The preceding diagram has been reproduced according to the license specified: https://commons.wikimedia.org/wiki/File:K-fold_cross_validation.jpg.
For instance, let's choose k = 4 to match the illustration in Figure 1.2. The green and red balls correspond to the target class, where, in this case, we only have two target classes. The data is shuffled beforehand, which can be seen from the absence of a pattern of green and red balls. It is also worth mentioning that the shuffling was previously only done once. That's why the order of green and red balls is always the same for each iteration (fold). The black box in each fold corresponds to the validation set (the test data is in the illustration).
As you can see in Figure 1.2, the proportion of the training set versus the validation set is (k-1):1, or in this case, 3:1. During each fold, the model will be trained on the train set and evaluated on the validation set. Notice that the training and validation sets are different across each fold. The final evaluation score can be calculated by taking the average score of all of the folds.
In summary, k-fold cross-validation works as follows:
- Shuffling the original full data
- Holding out the test data
- Performing the k-fold multiple evaluation strategy on the rest of the original full data
- Calculating the final evaluation score by taking the average score of all of the folds
- Evaluating the test data using the final model configuration
You might ask why do we need to perform cross-validation in the first place? Why is the vanilla train-validation-test splitting strategy not enough? There are several reasons why we need to apply the cross-validation strategy:
- Having only a small amount of training data.
- To get a more confident conclusion from the evaluation performance.
- To get a clearer picture of our model's learning ability and/or the complexity of the given data.
The first and second reasons are quite straightforward. The third reason is more interesting and should be discussed. How can cross-validation help us to get a better idea about our model's learning ability and/or the data complexity? Well, this happens when the variation of evaluation scores from each fold is quite big. For instance, out of 4 folds, we get accuracy scores of 45%, 82%, 64%, and 98%. This scenario should trigger our curiosity: what is wrong with our model and/or data? It could be that the data is too hard to learn and/or our model can't learn properly.
The following is the syntax to perform k-fold cross-validation via the Scikit-Learn package:
From sklearn.model_selection import train_test_split, Kfold df_cv, df_test = train_test_split(df, test_size=0.2, random_state=0) kf = Kfold(n_splits=4) for train_index, val_index in kf.split(df_cv): df_train, df_val = df_cv.iloc[train_index], df_cv.iloc[val_index] #perform training or hyperparameter tuning here
Notice that, first, we hold out the test set and only work with
df_cv when performing the k-fold cross-validation. By default, the
Kfold function will disable the shuffling procedure. However, this is not a problem for us since the data has already shuffled beforehand when we called the
train_test_split function. If you want to run the shuffling procedure again, you can pass
shuffle=True in the
From sklearn.model_selection import train_test_split, StratifiedKFold df_cv, df_test = train_test_split(df, test_size=0.2, random_state=0, stratify=df['class']) skf = StratifiedKFold(n_splits=4) for train_index, val_index in skf.split(df_cv, df_cv['class']): df_train, df_val = df_cv.iloc[train_index], df_cv.iloc[val_index] #perform training or hyperparameter tuning here
The only difference is to import
StratifiedKFold instead of the
Kfold function and add the array of target variables, which will be used to split the data in a stratified fashion.
In this section, you have learned what cross-validation is, when the right time is to perform cross-validation, and the first (and the most widely used) cross-validation strategy variation, which is called k-fold cross-validation. In the subsequent sections, we will also learn other variations of cross-validation and how to implement them using the Scikit-Learn package.
Discovering repeated k-fold cross-validation
Repeated k-fold cross-validation involves simply performing the k-fold cross-validation repeatedly, N times, with different randomizations in each repetition. The final evaluation score is the average of all scores from all folds of each repetition. This strategy will increase our confidence in our model.
So, why repeat the k-fold cross-validation? Why don't we just increase the value of k in k-fold? Surely, increasing the value of k will reduce the bias of our model's estimated performance. However, increasing the value of k will increase the variation, especially when we have a small number of samples. Therefore, usually, repeating the k-folds is a better way to gain higher confidence in our model's estimated performance. Of course, this comes with a drawback, which is the increase in computation time.
To implement this strategy, we can simply perform a manual for-loop, where we apply the k-fold cross-validation strategy to each loop. Fortunately, the Scikit-Learn package provide us with a specific function in which to implement this strategy:
from sklearn.model_selection import train_test_split, RepeatedKFold df_cv, df_test = train_test_split(df, test_size=0.2, random_state=0) rkf = RepeatedKFold(n_splits=4, n_repeats=3, random_state=0) for train_index, val_index in rkf.split(df_cv): df_train, df_val = df_cv.iloc[train_index], df_cv.iloc[val_index] #perform training or hyperparameter tuning here
n_repeats=3 means that we will have 12 different train and validation sets. The final evaluation score is then just the average of all 12 scores. As you might expect, there is also a dedicated function to implement the repeated k-fold in a stratified fashion:
from sklearn.model_selection import train_test_split, RepeatedStratifiedKFold df_cv, df_test = train_test_split(df, test_size=0.2, random_state=0, stratify=df['class']) rskf = RepeatedStratifiedKFold(n_splits=4, n_repeats=3, random_state=0) for train_index, val_index in rskf.split(df_cv, df_cv['class']): df_train, df_val = df_cv.iloc[train_index], df_cv.iloc[val_index] #perform training or hyperparameter tuning here
RepeatedStratifiedKFold function will perform stratified k-fold cross-validation repeatedly,
Discovering Leave-One-Out cross-validation
Essentially, Leave One Out (LOO) cross-validation is just k-fold cross-validation where k = n, where n is the number of samples. This means there are n-1 samples for the training set and 1 sample for the validation set in each fold (see Figure 1.3). Undoubtedly, this is a very computationally expensive strategy and will result in a very high variance evaluation score estimator:
Figure 1.3 – LOO cross-validation
So, when is LOO preferred over k-fold cross-validation? Well, LOO works best when you have a very small dataset. It is also good to choose LOO over k-fold if you prefer the high confidence of the model's performance estimation over the computational cost limitation.
Implementing this strategy from scratch is actually very simple. We just need to loop through each of the indexes of data and do some data manipulation. However, the Scikit-Learn package also provides the implementation for LOO, which we can use:
from sklearn.model_selection import train_test_split, LeaveOneOut df_cv, df_test = train_test_split(df, test_size=0.2, random_state=0) loo = LeaveOneOut() for train_index, val_index in loo.split(df_cv): df_train, df_val = df_cv.iloc[train_index], df_cv.iloc[val_index] #perform training or hyperparameter tuning here
Notice that there is no argument provided in the
LeaveOneOut function since this strategy is very straightforward and involves no stochastic procedure. There is also no stratified version of the LOO since the validation set will always contain one sample.
Now that you are aware of the concept of LOO, in the next section, we will learn about a slight variation of LOO.
Discovering LPO cross-validation
LPO cross-validation is a variation of the LOO cross-validation strategy, where the validation set in each fold contains p samples instead of only 1 sample. Similar to LOO, this strategy will ensure that we get all possible combinations of train-validation pairs. To be more precise, there will be number of folds assuming there are n samples on our data. For example, there will be or 142,506 folds if we want to perform Leave-5-Out cross-validation on data that has 50 samples.
LPO is suitable when you have a small number of samples and want to get even higher confidence in the model's estimated performance compared to the LOO method. LPO will result in an exploding number of folds when you have a large number of samples.
This strategy is a bit different from k-fold or LOO in terms of the overlapping between the validation sets. For P > 1, LPO will result in overlapping validation sets, while k-fold and LOO will always result in non-overlapping validation sets. Also, note that LPO is different from k-fold with K = N // P since k-fold will always create non-overlapping validation sets, but not with the LPO strategy:
from sklearn.model_selection import train_test_split, LeavePOut df_cv, df_test = train_test_split(df, test_size=0.2, random_state=0) lpo = LeavePOut(p=2) for train_index, val_index in lpo.split(df_cv): df_train, df_val = df_cv.iloc[train_index], df_cv.iloc[val_index] #perform training or hyperparameter tuning here
In this section, we have learned about the variations of the LOO cross-validation strategy. In the next section, we will learn how to perform cross-validation on time-series data.
Discovering time-series cross-validation
Time-series data has a unique characteristic in nature. Unlike "normal" data, which is assumed to be independent and identically distributed (IID), time-series data does not follow that assumption. In fact, each sample is dependent on previous samples, meaning changing the order of the samples will result in different data interpretations.
- Daily stock market price
- Hourly temperature data
- Minute-by-minute web page clicks count
There will be a look-ahead bias if we apply previous cross-validation strategies (for example, k-fold or random or stratified splits) to time-series data. Look-ahead bias happens when we use the future value of the data that is supposedly not available for the current time of the simulation.
For instance, we are working with hourly temperature data. We want to predict what the temperature will be in 2 hours, but we use the temperature value of the next hour or the next 3 hours, which is supposedly not available yet. This kind of bias will happen easily if we apply the previous cross-validation strategies since those strategies are designed to work well only on IID distribution.
Time-series cross-validation is the cross-validation strategy that is specifically designed to handle time-series data. It works similarly to k-fold in terms of accepting the predefined values of folds, which then generates k test sets. The difference is that the data is not shuffled in the first place, and the training set in the next iteration is the superset of the one in the previous iteration, meaning the training set keeps getting bigger over the number of iterations. Once we finish with the cross-validation and get the final model configuration, we can then test our final model on the test data (see Figure 1.4):
Figure 1.4 – Time-series cross-validation
Also, the Scikit-Learn package provides us with a nice implementation of this strategy:
from sklearn.model_selection import train_test_split, TimeSeriesSplit df_cv, df_test = train_test_split(df, test_size=0.2, random_state=0, shuffle=False) tscv = TimeSeriesSplit(n_splits=5) for train_index, val_index in tscv.split(df_cv): df_train, df_val = df_cv.iloc[train_index], df_cv.iloc[val_index] #perform training or hyperparameter tuning here
Providing n_splits=5 will ensure that there are five test sets generated. It is worth noting that, by default, the train set will have the size of for the ith fold, while the test set will have the size of .
However, you can change the train and test set size via the
test_size arguments of the
TimeSeriesSplit function. Additionally, there is also a
gap argument that can be utilized to exclude G samples from the end of each train set, where G is the value needed to be specified by the developer.
You need to be aware that the Scikit-Learn implementation will always make sure that there is no overlap between test sets, which is actually not necessary. Currently, there is no way to enable the overlap between the test sets using the Scikit-Learn implementation. You need to write the code from scratch to perform that kind of strategy.
In this section, we learned about the unique characteristic of time-series data and how to perform a cross-validation strategy on it. There are other variations of the cross-validation strategy that haven't been covered in this book. If you are interested, you might find some pointers in the Further reading section.
In this chapter, we learned a lot of important things that we need to know regarding how to evaluate ML models properly. Starting from the concept of overfitting, numerous data splitting strategies, how to choose the best data splitting strategy based on the given situation, and how to implement each of them using the Scikit-Learn package. Understanding these concepts is important since you can't perform a good hyperparameter tuning process without applying the appropriate data splitting strategy.
In the next chapter, we will discuss hyperparameter tuning. We will not only discuss the definition but also several misconceptions and types of hyperparameter distributions.
In this chapter, we have covered a lot of topics. However, there are still many uncovered interesting algorithms related to cross-validation due to the scope of this book. If you want to learn more about those algorithms and the implementation details of each of them, you can refer to this awesome page created by the Scikit-Learn authors at https://scikit-learn.org/stable/modules/cross_validation.html.