You're reading from Hyperparameter Tuning with Python Boost your machine learning model's performance via hyperparameter tuning

Product type Paperback

Published in Jul 2022

Publisher Packt

ISBN-13 9781803235875

Length 306 pages

Edition 1st Edition

Languages

Python

Tools

Boost

Concepts

Machine Learning

Author (1):

Louis Owen

View More author details

Table of Contents (19) Chapters

Preface

1. Section 1:The Methods

2. Chapter 1: Evaluating Machine Learning Models FREE CHAPTER

3. Chapter 2: Introducing Hyperparameter Tuning

4. Chapter 3: Exploring Exhaustive Search

5. Chapter 4: Exploring Bayesian Optimization

6. Chapter 5: Exploring Heuristic Search

7. Chapter 6: Exploring Multi-Fidelity Optimization

8. Section 2:The Implementation

9. Chapter 7: Hyperparameter Tuning via Scikit

10. Chapter 8: Hyperparameter Tuning via Hyperopt

11. Chapter 9: Hyperparameter Tuning via Optuna

12. Chapter 10: Advanced Hyperparameter Tuning with DEAP and Microsoft NNI

13. Section 3:Putting Things into Practice

14. Chapter 11: Understanding the Hyperparameters of Popular Algorithms

15. Chapter 12: Introducing Hyperparameter Tuning Decision Map

16. Chapter 13: Tracking Hyperparameter Tuning Experiments

17. Chapter 14: Conclusions and Next Steps

18. Other Books You May Enjoy

Creating training, validation, and test sets

We understand that overfitting can be detected by monitoring the model's performance on the training data versus the unseen data, but what exactly is unseen data? Is it just random data that has not yet been seen by the model during the training phase?

Unseen data is a portion of our original complete data that was not seen by the model during the training phase. We usually refer to this unseen data as the test set. Let's imagine you have 100,000 samples of data, to begin with; you can take out a portion of the data, let's say 10% of it, to become the test set. So, now we have 90,000 samples as the training set and 10,000 samples as the testing set.

However, it is better to not just split our original data into train and test sets but also into a validation set, especially when we want to perform hyperparameter tuning on our model. Let's say that out of 100,000 original samples, we held out 10% of it to become the validation set and another 10% to become the test set. Therefore, we will have 80,000 samples as the train set, 10,000 samples as the validation set, and 10,000 samples as the test set.

You might be wondering why do we need a validation set apart from the test set. Actually, we do not need it if we do not want to perform hyperparameter tuning or any other model-centric approaches. This is because the purpose of having a validation set is to have an unbiased evaluation of the test set using the final version of the trained model.

A validation set can help us to get an unbiased evaluation of the test set because we only incorporate the validation set during the hyperparameter tuning phase. Once we finish the hyperparameter tuning phase and get the final model configuration, we can then evaluate our model on the purely unseen data, which is called the test set.

Important Note

If you are going to perform any data preprocessing steps (for example, missing value imputation, feature engineering, standardization, label encoding, and more), you have to build the function based on the train set and then apply it to the validation and test set. Do not perform those data preprocessing steps on the full original data (before data splitting). That's because it might lead to a data leakage problem.

There is no specific rule when it comes to choosing the proportions for each of the train, validation, and test sets. You have to choose the split proportion by yourself based on the condition you are faced with. However, the common splitting proportion used by the data science community is 8:2 or 9:1 for the train set and the validation and test set, respectively. Usually, the validation and test set will have a proportion of 1:1. Therefore, the common splitting proportion is 8:1:1 or 9:0.5:0.5 for the train, validation, and test sets, respectively.

Now that we are aware of the train, validation, and test set concept, we need to learn how to build those sets. Do we just randomly split our original data into three sets? Or can we also apply some predefined rules? In the next section, we will explore this topic in more detail.

Tech Concepts

Programming languages

Tech Tools

Unlimited access to the largest independent learning library in tech of over 8,000 expert-authored tech books and videos.

Innovative learning tools, including AI book assistants, code context explainers, and text-to-speech.

50+ new titles added per month and exclusive early access to books as they are being written.

You're reading from Hyperparameter Tuning with Python Boost your machine learning model's performance via hyperparameter tuning

Table of Contents (19) Chapters

Creating training, validation, and test sets

Authors (1)

Personalised recommendations for you

You're reading from Hyperparameter Tuning with Python Boost your machine learning model's performance via hyperparameter tuning

Table of Contents (19) Chapters

Creating training, validation, and test sets

Authors (1)

Personalised recommendations for you

Create a Free Account To Continue Reading

Sign in to activate your 7-day free access