Reader small image

You're reading from  The Applied Data Science Workshop - Second Edition

Product typeBook
Published inJul 2020
Reading LevelIntermediate
PublisherPackt
ISBN-139781800202504
Edition2nd Edition
Languages
Tools
Concepts
Right arrow
Author (1)
Alex Galea
Alex Galea
author image
Alex Galea

Alex Galea has been professionally practicing data analytics since graduating with a masters degree in physics from the University of Guelph, Canada. He developed a keen interest in Python while researching quantum gases as part of his graduate studies. Alex is currently doing web data analytics, where Python continues to play a key role in his work. He is a frequent blogger about data-centric projects that involve Python and Jupyter Notebooks.
Read more about Alex Galea

Right arrow

5. Model Validation and Optimization

Overview

In this chapter, you will learn how to use k-fold cross validation to test model performance, as well as how to use validation curves to optimize model parameters. You will also learn how to implement dimensionality reduction techniques such as Principal Component Analysis (PCA). By the end of this chapter, you will have completed an end-to-end machine learning project and produced a final model that can be used to make business decisions.

Introduction

As we've seen in the previous chapters, it's easy to train models with scikit-learn using just a few lines of Python code. This is possible by abstracting away the computational complexity of the algorithm, including details such as constructing cost functions and optimizing model parameters. In other words, we deal with a black box where the internal operations are hidden from us.

While the simplicity offered by this approach is quite nice on the surface, it does nothing to prevent the misuse of algorithms—for example, by selecting the wrong model for a dataset, overfitting on the training set, or failing to test properly on unseen data.

In this chapter, we'll show you how to avoid some of these pitfalls while training classification models and equip you with the tools to produce trustworthy results. We'll introduce k-fold cross validation and validation curves, and then look at ways to use them in Jupyter.

We'll also introduce...

Assessing Models with k-Fold Cross Validation

Thus far, we have trained models on a subset of the data and then assessed performance on the unseen portion, called the test set. This is good practice because the model's performance on data that's used for training is not a good indicator of its effectiveness as a predictor. It's very easy to increase accuracy on a training dataset by overfitting a model, which results in a poorer performance on unseen data.

That being said, simply training models on data that's been split in this way is not good enough. There is a natural variance in data that causes accuracies to be different (if even slightly), depending on the training and test splits. Furthermore, using only one training/test split to compare models can introduce bias toward certain models and lead to overfitting.

k-Fold cross validation offers a solution to this problem and allows the variance to be accounted for by way of an error estimate on each accuracy...

Dimensionality Reduction with PCA

Dimensionality reduction can be as simple as removing unimportant features from the training data. However, it's usually not obvious that removing a set of features will boost model performance. Even features that are highly noisy may offer some valuable information that models can learn from. For these reasons, we should know about better methods for reducing data dimensionality, such as the following:

  • Principal Component Analysis (PCA)
  • Linear Discriminant Analysis (LDA)

These techniques allow for data compression, where the most important information from a large group of features can be encoded in just a few features.

In this section, we'll focus on PCA. This technique transforms the data by projecting it into a new subspace of orthogonal principal components, where the components with the highest eigenvalues (as described here) encode the most information for training the model. Then, we can simply select a set of...

Summary

In this chapter, we have seen how to use Jupyter Notebooks to perform parameter optimization and model selection.

We built upon the work we did in the previous chapter, where we trained predictive classification models for our binary problem and saw how decision boundaries are drawn for SVM, KNN, and Random Forest models. We improved on these simple models by using validation curves to optimize parameters and explored how dimensionality reduction can improve model performance as well.

Finally, at the end of the last exercise, we explored how the final model can be used in practice to make data-driven decisions. This demonstration connects our results back to the original business problem that inspired our modeling problem initially.

In the next chapter, we will depart from machine learning and focus on data acquisition instead. Specifically, we will discuss methods for extracting web data and learn about HTTP requests, web scraping with Python, and more data processing...

lock icon
The rest of the chapter is locked
You have been reading a chapter from
The Applied Data Science Workshop - Second Edition
Published in: Jul 2020Publisher: PacktISBN-13: 9781800202504
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Alex Galea

Alex Galea has been professionally practicing data analytics since graduating with a masters degree in physics from the University of Guelph, Canada. He developed a keen interest in Python while researching quantum gases as part of his graduate studies. Alex is currently doing web data analytics, where Python continues to play a key role in his work. He is a frequent blogger about data-centric projects that involve Python and Jupyter Notebooks.
Read more about Alex Galea