Hyperparameter Tuning with Python

Chapter 1: Evaluating Machine Learning Models

Machine Learning (ML) models need to be thoroughly evaluated to ensure they will work in production. We have to ensure the model is not memorizing the training data and also ensure it learns enough from the given training data. Choosing the appropriate evaluation method is also critical when we want to perform hyperparameter tuning at a later stage.

In this chapter, we'll learn about all the important things we need to know when it comes to evaluating ML models. First, we need to understand the concept of overfitting. Then, we will look at the idea of splitting data into train, validation, and test sets. Additionally, we'll learn about the difference between random and stratified splits and when to use each of them.

We'll discuss the concept of cross-validation and its numerous variations of strategy: k-fold repeated k-fold, Leave One Out (LOO), Leave P Out (LPO), and a specific strategy when dealing with time-series data, called time-series cross-validation. We'll also learn how to implement each of the evaluation strategies using the Scikit-Learn package.

By the end of this chapter, you will have a good understanding of why choosing a proper evaluation strategy is critical in the ML model development life cycle. Also, you will be aware of numerous evaluation strategies and will be able to choose the most appropriate one for your situation. Furthermore, you will also be able to implement each of the evaluation strategies using the Scikit-Learn package.

In this chapter, we're going to cover the following main topics:

Understanding the concept of overfitting
Creating training, validation, and test sets
Exploring random and stratified split
Discovering k-fold cross-validation
Discovering repeated k-fold cross-validation
Discovering LOO cross-validation
Discovering LPO cross-validation
Discovering time-series cross-validation

Creating training, validation, and test sets

We understand that overfitting can be detected by monitoring the model's performance on the training data versus the unseen data, but what exactly is unseen data? Is it just random data that has not yet been seen by the model during the training phase?

Unseen data is a portion of our original complete data that was not seen by the model during the training phase. We usually refer to this unseen data as the test set. Let's imagine you have 100,000 samples of data, to begin with; you can take out a portion of the data, let's say 10% of it, to become the test set. So, now we have 90,000 samples as the training set and 10,000 samples as the testing set.

However, it is better to not just split our original data into train and test sets but also into a validation set, especially when we want to perform hyperparameter tuning on our model. Let's say that out of 100,000 original samples, we held out 10% of it to become the validation set and another 10% to become the test set. Therefore, we will have 80,000 samples as the train set, 10,000 samples as the validation set, and 10,000 samples as the test set.

You might be wondering why do we need a validation set apart from the test set. Actually, we do not need it if we do not want to perform hyperparameter tuning or any other model-centric approaches. This is because the purpose of having a validation set is to have an unbiased evaluation of the test set using the final version of the trained model.

A validation set can help us to get an unbiased evaluation of the test set because we only incorporate the validation set during the hyperparameter tuning phase. Once we finish the hyperparameter tuning phase and get the final model configuration, we can then evaluate our model on the purely unseen data, which is called the test set.

Important Note

If you are going to perform any data preprocessing steps (for example, missing value imputation, feature engineering, standardization, label encoding, and more), you have to build the function based on the train set and then apply it to the validation and test set. Do not perform those data preprocessing steps on the full original data (before data splitting). That's because it might lead to a data leakage problem.

There is no specific rule when it comes to choosing the proportions for each of the train, validation, and test sets. You have to choose the split proportion by yourself based on the condition you are faced with. However, the common splitting proportion used by the data science community is 8:2 or 9:1 for the train set and the validation and test set, respectively. Usually, the validation and test set will have a proportion of 1:1. Therefore, the common splitting proportion is 8:1:1 or 9:0.5:0.5 for the train, validation, and test sets, respectively.

Now that we are aware of the train, validation, and test set concept, we need to learn how to build those sets. Do we just randomly split our original data into three sets? Or can we also apply some predefined rules? In the next section, we will explore this topic in more detail.

Exploring random and stratified splits

The most straightforward way (but not entirely a correct way) to split our original full data into train, validation, and test sets is by choosing the proportions for each set and then directly splitting them into three sets based on the order of the index.

For instance, the original full data has 100,000 samples, and we want to split this into train, validation, and test sets with a proportion of 8:1:1. Then, the training set will be the samples from index 1 until 80,000. The validation and test set will be the index from 81,000 until 90,000 and 91,000 until 100,000, respectively.

So, what's wrong with that approach? There is nothing wrong with that approach as long as the original full data is shuffled. It might cause a problem when there is some kind of pattern between the indices of the samples.

For instance, we have data consisting of 10,000 samples and 3 columns. The first and second columns contain weight and height information, respectively. The third column contains the "weight status" class (for example, underweight, normal weight, overweight, and obesity). Our task is to build an ML classifier model to predict what the "weight status" class of a person is, given their weight and height. It is not impossible for the data to be given to us in the condition that it was ordered based on the third column. So, the first 80,000 rows only consist of the underweight and normal weight classes. In comparison, the overweight and obesity classes are only located in the last 20,000 rows. If this is the case, and we apply the data splitting logic from earlier, then there is no way our classifier can predict a new person has the overweight or obesity "weight status" classes. Why? Because our classifier has never seen those classes before during the training phase!

Therefore, it is very important to ensure the original full data is shuffled in the first place, and essentially, this is what we mean by the random split. Random split works by first shuffling the original full data and then splitting it into the train, validation, and test sets based on the order of the index.

There is also another splitting logic called the stratified split. This logic ensures that the train, validation, and test set will get a similar proportion number of samples for each target class found in the original full data.

Using the same "weight status" class prediction case example, let's say that we found that the proportion of each class in the full original data is 3:5:1.5:0.5 for underweight, normal weight, overweight, and obese, respectively. The stratified split logic will ensure that we can find a similar proportion of those classes in the train, validation, and test sets. So, out of 80,000 samples of the train set, around 24,000 samples are in the underweight class, around 40,000 samples are in the normal weight class, around 12,000 samples are overweight, and around 4,000 samples are in the obesity class. This will also be applied to the validation and test set.

The remaining question is understanding when it is the right time to use the random split/stratified split logic. Often, the stratified split logic is used when we are faced with an imbalanced class problem. However, it is also often used when we want to make sure that we have a similar proportion of samples based on a specific variable (not necessarily the target class). If you are not faced with this kind of situation, then the random split is the go-to logic that you can always choose.

To implement both of the data splitting logics, you can write the code by yourself from scratch or utilize the well-known package called Scikit-Learn. The following is an example to perform a random split with a proportion of 8:1:1:

from sklearn.model_selection import train_test_split

df_train, df_unseen = train_test_split(df, test_size=0.2, random_state=0)

df_val, df_test = train_test_split(df_unseen, test_size=0.5, random_state=0)

The df variable is our complete original data that was stored in the Pandas DataFrame object. The train_test_split function splits the Pandas DataFrame, array, or matrix into shuffled train and test sets. In lines 2–3, first, we split the original full data into df_train and df_unseen with a proportion of 8:2, as specified by the test_size argument. Then, we split df_unseen into df_val and df_test with a proportion of 1:1.

To perform the stratify split logic, you can just add the stratify argument to the train_test_split function and fill it with the target array:

df_train, df_unseen = train_test_split(df, test_size=0.2, random_state=0, stratify=df['class'])

df_val, df_test = train_test_split(df_unseen, test_size=0.5, random_state=0, stratify=df_unseen['class'])

The stratify argument will ensure the data is split in the stratified fashion based on the given target array.

In this section, we have learned the importance of shuffling the original full data before performing data splitting and also understand the difference between the random and stratified split, as well as when to use each of them. In the next section, we will start learning variations of the data splitting strategies and how to implement each of them using the Scikit-learn package.

Discovering k-fold cross-validation

Cross-validation is a way to evaluate our ML model by performing multiple evaluations on our original full data via a resampling procedure. This is a variation from the vanilla train-validation-test split that we learned about in previous sections. Additionally, the concept of random and stratified splits can be applied in cross-validation.

In cross-validation, we perform multiple splits for the train and validation sets, where each split is usually referred to Fold. What about the test set? Well, it still acts as the purely unseen data where we can test the final model configuration on it. Therefore, in the beginning, it is only separated once from the train and validation set.

There are several variations of the cross-validation strategy. The first one is called k-fold cross-validation. It works by performing k times of training and evaluation with a proportion of (k-1):1 for the train and validation set, respectively, in each fold. To have a clearer understanding of k-fold cross-validation, please refer to Figure 1.2:

Figure 1.2 – K-fold cross-validation

Note

The preceding diagram has been reproduced according to the license specified: https://commons.wikimedia.org/wiki/File:K-fold_cross_validation.jpg.

For instance, let's choose k = 4 to match the illustration in Figure 1.2. The green and red balls correspond to the target class, where, in this case, we only have two target classes. The data is shuffled beforehand, which can be seen from the absence of a pattern of green and red balls. It is also worth mentioning that the shuffling was previously only done once. That's why the order of green and red balls is always the same for each iteration (fold). The black box in each fold corresponds to the validation set (the test data is in the illustration).

As you can see in Figure 1.2, the proportion of the training set versus the validation set is (k-1):1, or in this case, 3:1. During each fold, the model will be trained on the train set and evaluated on the validation set. Notice that the training and validation sets are different across each fold. The final evaluation score can be calculated by taking the average score of all of the folds.

In summary, k-fold cross-validation works as follows:

Shuffling the original full data
Holding out the test data
Performing the k-fold multiple evaluation strategy on the rest of the original full data
Calculating the final evaluation score by taking the average score of all of the folds
Evaluating the test data using the final model configuration

You might ask why do we need to perform cross-validation in the first place? Why is the vanilla train-validation-test splitting strategy not enough? There are several reasons why we need to apply the cross-validation strategy:

Having only a small amount of training data.
To get a more confident conclusion from the evaluation performance.
To get a clearer picture of our model's learning ability and/or the complexity of the given data.

The first and second reasons are quite straightforward. The third reason is more interesting and should be discussed. How can cross-validation help us to get a better idea about our model's learning ability and/or the data complexity? Well, this happens when the variation of evaluation scores from each fold is quite big. For instance, out of 4 folds, we get accuracy scores of 45%, 82%, 64%, and 98%. This scenario should trigger our curiosity: what is wrong with our model and/or data? It could be that the data is too hard to learn and/or our model can't learn properly.

The following is the syntax to perform k-fold cross-validation via the Scikit-Learn package:

From sklearn.model_selection import train_test_split, Kfold

df_cv, df_test = train_test_split(df, test_size=0.2, random_state=0)

kf = Kfold(n_splits=4)

for train_index, val_index in kf.split(df_cv):

df_train, df_val = df_cv.iloc[train_index], df_cv.iloc[val_index]

#perform training or hyperparameter tuning here

Notice that, first, we hold out the test set and only work with df_cv when performing the k-fold cross-validation. By default, the Kfold function will disable the shuffling procedure. However, this is not a problem for us since the data has already shuffled beforehand when we called the train_test_split function. If you want to run the shuffling procedure again, you can pass shuffle=True in the Kfold function.

Here is another example if you are interested in learning how to apply the concept of stratifying splits in k-fold cross-validation:

From sklearn.model_selection import train_test_split, StratifiedKFold

df_cv, df_test = train_test_split(df, test_size=0.2, random_state=0, stratify=df['class'])

skf = StratifiedKFold(n_splits=4)

for train_index, val_index in skf.split(df_cv, df_cv['class']):

df_train, df_val = df_cv.iloc[train_index], df_cv.iloc[val_index]

#perform training or hyperparameter tuning here

The only difference is to import StratifiedKFold instead of the Kfold function and add the array of target variables, which will be used to split the data in a stratified fashion.

In this section, you have learned what cross-validation is, when the right time is to perform cross-validation, and the first (and the most widely used) cross-validation strategy variation, which is called k-fold cross-validation. In the subsequent sections, we will also learn other variations of cross-validation and how to implement them using the Scikit-Learn package.

Discovering repeated k-fold cross-validation

Repeated k-fold cross-validation involves simply performing the k-fold cross-validation repeatedly, N times, with different randomizations in each repetition. The final evaluation score is the average of all scores from all folds of each repetition. This strategy will increase our confidence in our model.

So, why repeat the k-fold cross-validation? Why don't we just increase the value of k in k-fold? Surely, increasing the value of k will reduce the bias of our model's estimated performance. However, increasing the value of k will increase the variation, especially when we have a small number of samples. Therefore, usually, repeating the k-folds is a better way to gain higher confidence in our model's estimated performance. Of course, this comes with a drawback, which is the increase in computation time.

To implement this strategy, we can simply perform a manual for-loop, where we apply the k-fold cross-validation strategy to each loop. Fortunately, the Scikit-Learn package provide us with a specific function in which to implement this strategy:

from sklearn.model_selection import train_test_split, RepeatedKFold

df_cv, df_test = train_test_split(df, test_size=0.2, random_state=0)

rkf = RepeatedKFold(n_splits=4, n_repeats=3, random_state=0)

for train_index, val_index in rkf.split(df_cv):

df_train, df_val = df_cv.iloc[train_index], df_cv.iloc[val_index]

#perform training or hyperparameter tuning here

Choosing n_splits=4 and n_repeats=3 means that we will have 12 different train and validation sets. The final evaluation score is then just the average of all 12 scores. As you might expect, there is also a dedicated function to implement the repeated k-fold in a stratified fashion:

from sklearn.model_selection import train_test_split, RepeatedStratifiedKFold

df_cv, df_test = train_test_split(df, test_size=0.2, random_state=0, stratify=df['class'])

rskf = RepeatedStratifiedKFold(n_splits=4, n_repeats=3, random_state=0)

for train_index, val_index in rskf.split(df_cv, df_cv['class']):

df_train, df_val = df_cv.iloc[train_index], df_cv.iloc[val_index]

#perform training or hyperparameter tuning here

The RepeatedStratifiedKFold function will perform stratified k-fold cross-validation repeatedly, n_repeats times.

Now that you have learned another variation of the cross-validation strategy, called repeated k-fold cross-validation, let's learn about the other variations next.

Discovering time-series cross-validation

Time-series data has a unique characteristic in nature. Unlike "normal" data, which is assumed to be independent and identically distributed (IID), time-series data does not follow that assumption. In fact, each sample is dependent on previous samples, meaning changing the order of the samples will result in different data interpretations.

Several examples of time-series data are listed as follows:

Daily stock market price
Hourly temperature data
Minute-by-minute web page clicks count

There will be a look-ahead bias if we apply previous cross-validation strategies (for example, k-fold or random or stratified splits) to time-series data. Look-ahead bias happens when we use the future value of the data that is supposedly not available for the current time of the simulation.

For instance, we are working with hourly temperature data. We want to predict what the temperature will be in 2 hours, but we use the temperature value of the next hour or the next 3 hours, which is supposedly not available yet. This kind of bias will happen easily if we apply the previous cross-validation strategies since those strategies are designed to work well only on IID distribution.

Time-series cross-validation is the cross-validation strategy that is specifically designed to handle time-series data. It works similarly to k-fold in terms of accepting the predefined values of folds, which then generates k test sets. The difference is that the data is not shuffled in the first place, and the training set in the next iteration is the superset of the one in the previous iteration, meaning the training set keeps getting bigger over the number of iterations. Once we finish with the cross-validation and get the final model configuration, we can then test our final model on the test data (see Figure 1.4):

Figure 1.4 – Time-series cross-validation

Also, the Scikit-Learn package provides us with a nice implementation of this strategy:

from sklearn.model_selection import train_test_split, TimeSeriesSplit

df_cv, df_test = train_test_split(df, test_size=0.2, random_state=0, shuffle=False)

tscv = TimeSeriesSplit(n_splits=5)

for train_index, val_index in tscv.split(df_cv):

df_train, df_val = df_cv.iloc[train_index], df_cv.iloc[val_index]

#perform training or hyperparameter tuning here

Providing n_splits=5 will ensure that there are five test sets generated. It is worth noting that, by default, the train set will have the size of for the ith fold, while the test set will have the size of .

However, you can change the train and test set size via the max_train_size and test_size arguments of the TimeSeriesSplit function. Additionally, there is also a gap argument that can be utilized to exclude G samples from the end of each train set, where G is the value needed to be specified by the developer.

You need to be aware that the Scikit-Learn implementation will always make sure that there is no overlap between test sets, which is actually not necessary. Currently, there is no way to enable the overlap between the test sets using the Scikit-Learn implementation. You need to write the code from scratch to perform that kind of strategy.

In this section, we learned about the unique characteristic of time-series data and how to perform a cross-validation strategy on it. There are other variations of the cross-validation strategy that haven't been covered in this book. If you are interested, you might find some pointers in the Further reading section.

Amazon Customer Nov 27, 2022

This book is a good book for an aspiring data scientist who are familiar with machine learning techniques and have briefly introduced themselves to what hyper-parameter optimization is. It discusses in detail a variety of hyper-parameter optimization techniques and when and how to put them into practice.It is a great book for a new learner trying to improve skills in hyper-parameter optimization. 7 broadly categorized hyper-parameter optimization techniques are explained very well and gives you the opportunity to learn hyper-parameter optimization in one place -thereby expediting your learning.

Amazon Verified review

Toni P Jan 22, 2024

Good book if you need more views how to get the ML model to better shape. All the best to the future.

Caitlin Nov 30, 2022

I really enjoyed the format, writing and content of this book. The author does a nice job of giving the high level explanation and low-level coding examples for a broad variety of hyperparameter tuning approaches, methods and packages. You're left with the knowledge that you know when to use which option and, most importantly, why. This is a really solid read for the beginner-intermediate machine learning practitioner to develop their intuition and understanding around the subject, and more advanced practitioners could also use this book as a refresher or to extend their knowledge of new hyperparameter tuning packages.

Yiqiao Yin Sep 02, 2022

Hyperparameters are an important element in building useful machine learning models. This book curates numerous hyperparameter tuning methods for Python, one of the most popular coding languages for machine learning. Learned a lot about the fundamental idea behind parameters tuning! It’s highly recommended!

Dror Feb 26, 2023

Machine learning (ML) and artificial intelligence have taken the world by storm and revolutionized entire fields such as computer vision and natural language processing. Building effective ML models requires choosing first and foremost the right architecture, and an essential part of this process is choosing an optimal or near-optimal set of hyperparameters. Due to the somewhat mechanical nature of hyperparameter optimization, its importance is often underestimated by academics and practitioners alike.This unique book serves as a comprehensive guide to hyperparameter optimization. It begins with an introduction to hyperparameter tuning, and describes the main techniques involved: exhaustive search, heuristic search, Bayesian optimization and multi-fidelity optimization. The second part of the book provides a practical and helpful overview of the top relevant frameworks, such as scikit-learn, Hyperopt, Optuna, NNI and DEAP. The associated GitHub repository includes a useful collection of Colab Notebooks to demonstrate the implementation of the presented techniques.This practical book will benefit ML researchers, data scientists and software engineers who build and train ML models. It requires no more than a basic understanding of ML and some familiarity with the Python programming language. In return, the reader will gain a thorough understanding of one of the more important and underappreciated aspects of training ML models - hyperparameter tuning.Highly recommended!

Hyperparameter Tuning with Python: Boost your machine learning model's performance via hyperparameter tuning

What do you get with Print?

Hyperparameter Tuning with Python

Chapter 1: Evaluating Machine Learning Models

Technical requirements

Understanding the concept of overfitting

Creating training, validation, and test sets

Exploring random and stratified splits

Discovering repeated k-fold cross-validation

Discovering Leave-One-Out cross-validation

Discovering LPO cross-validation

Discovering time-series cross-validation

Summary

Further reading

Page 1 of 11

Key benefits

Description

Who is this book for?

What you will learn

Product Details

What do you get with Print?

Product Details

Frequently bought together

Table of Contents

Recommendations for you

Customer reviews

People who bought this also bought

About the author

FAQs

Hyperparameter Tuning with Python: Boost your machine learning model's performance via hyperparameter tuning

What do you get with Print?

Contact Details

Shipping Address

Billing Address

Key benefits

Description

Who is this book for?

What you will learn

Product Details

What do you get with Print?

Contact Details

Shipping Address

Billing Address

Product Details

Packt Subscriptions

Frequently bought together

Table of Contents

Recommendations for you

Customer reviews

People who bought this also bought

About the author

FAQs

Create a Free Account To Continue Reading

Sign in to activate your 7-day free access