Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletter Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds
Arrow up icon
GO TO TOP
Hyperparameter Tuning with Python

You're reading from   Hyperparameter Tuning with Python Boost your machine learning model's performance via hyperparameter tuning

Arrow left icon
Product type Paperback
Published in Jul 2022
Publisher Packt
ISBN-13 9781803235875
Length 306 pages
Edition 1st Edition
Languages
Tools
Arrow right icon
Author (1):
Arrow left icon
Louis Owen Louis Owen
Author Profile Icon Louis Owen
Louis Owen
Arrow right icon
View More author details
Toc

Table of Contents (19) Chapters Close

Preface 1. Section 1:The Methods
2. Chapter 1: Evaluating Machine Learning Models FREE CHAPTER 3. Chapter 2: Introducing Hyperparameter Tuning 4. Chapter 3: Exploring Exhaustive Search 5. Chapter 4: Exploring Bayesian Optimization 6. Chapter 5: Exploring Heuristic Search 7. Chapter 6: Exploring Multi-Fidelity Optimization 8. Section 2:The Implementation
9. Chapter 7: Hyperparameter Tuning via Scikit 10. Chapter 8: Hyperparameter Tuning via Hyperopt 11. Chapter 9: Hyperparameter Tuning via Optuna 12. Chapter 10: Advanced Hyperparameter Tuning with DEAP and Microsoft NNI 13. Section 3:Putting Things into Practice
14. Chapter 11: Understanding the Hyperparameters of Popular Algorithms 15. Chapter 12: Introducing Hyperparameter Tuning Decision Map 16. Chapter 13: Tracking Hyperparameter Tuning Experiments 17. Chapter 14: Conclusions and Next Steps 18. Other Books You May Enjoy

Creating training, validation, and test sets

We understand that overfitting can be detected by monitoring the model's performance on the training data versus the unseen data, but what exactly is unseen data? Is it just random data that has not yet been seen by the model during the training phase?

Unseen data is a portion of our original complete data that was not seen by the model during the training phase. We usually refer to this unseen data as the test set. Let's imagine you have 100,000 samples of data, to begin with; you can take out a portion of the data, let's say 10% of it, to become the test set. So, now we have 90,000 samples as the training set and 10,000 samples as the testing set.

However, it is better to not just split our original data into train and test sets but also into a validation set, especially when we want to perform hyperparameter tuning on our model. Let's say that out of 100,000 original samples, we held out 10% of it to become the validation set and another 10% to become the test set. Therefore, we will have 80,000 samples as the train set, 10,000 samples as the validation set, and 10,000 samples as the test set.

You might be wondering why do we need a validation set apart from the test set. Actually, we do not need it if we do not want to perform hyperparameter tuning or any other model-centric approaches. This is because the purpose of having a validation set is to have an unbiased evaluation of the test set using the final version of the trained model.

A validation set can help us to get an unbiased evaluation of the test set because we only incorporate the validation set during the hyperparameter tuning phase. Once we finish the hyperparameter tuning phase and get the final model configuration, we can then evaluate our model on the purely unseen data, which is called the test set.

Important Note

If you are going to perform any data preprocessing steps (for example, missing value imputation, feature engineering, standardization, label encoding, and more), you have to build the function based on the train set and then apply it to the validation and test set. Do not perform those data preprocessing steps on the full original data (before data splitting). That's because it might lead to a data leakage problem.

There is no specific rule when it comes to choosing the proportions for each of the train, validation, and test sets. You have to choose the split proportion by yourself based on the condition you are faced with. However, the common splitting proportion used by the data science community is 8:2 or 9:1 for the train set and the validation and test set, respectively. Usually, the validation and test set will have a proportion of 1:1. Therefore, the common splitting proportion is 8:1:1 or 9:0.5:0.5 for the train, validation, and test sets, respectively.

Now that we are aware of the train, validation, and test set concept, we need to learn how to build those sets. Do we just randomly split our original data into three sets? Or can we also apply some predefined rules? In the next section, we will explore this topic in more detail.

Visually different images
CONTINUE READING
83
Tech Concepts
36
Programming languages
73
Tech Tools
Icon Unlimited access to the largest independent learning library in tech of over 8,000 expert-authored tech books and videos.
Icon Innovative learning tools, including AI book assistants, code context explainers, and text-to-speech.
Icon 50+ new titles added per month and exclusive early access to books as they are being written.
Hyperparameter Tuning with Python
You have been reading a chapter from
Hyperparameter Tuning with Python
Published in: Jul 2022
Publisher: Packt
ISBN-13: 9781803235875
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at €18.99/month. Cancel anytime
Modal Close icon
Modal Close icon