Search icon
Arrow left icon
All Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletters
Free Learning
Arrow right icon
Mastering Predictive Analytics with scikit-learn and TensorFlow

You're reading from  Mastering Predictive Analytics with scikit-learn and TensorFlow

Product type Book
Published in Sep 2018
Publisher Packt
ISBN-13 9781789617740
Pages 154 pages
Edition 1st Edition
Languages
Author (1):
Alvaro Fuentes Alvaro Fuentes
Profile icon Alvaro Fuentes

Cross-validation and Parameter Tuning

Predictive analytics is about making predictions for unknown events. We use it to produce models that generalize data. For this, we use a technique called cross-validation.

Cross-validation is a validation technique for assessing the result of a statistical analysis that generalizes to an independent dataset that gives a measure of out-of-sample accuracy. It achieves the task by averaging over several random partitions of the data into training and test samples. It is often used for hyperparameter tuning by doing cross-validation for several possible values of a parameter and choosing the parameter value that gives the lowest cross-validation average error.

There are two kinds of cross-validation: exhaustive and non-exhaustive. K-fold is an example of non-exhaustive cross-validation. It is a technique for getting a more accurate assessment...

Holdout cross-validation

In holdout cross-validation, we hold out a percentage of observations and so we get two datasets. One is called the training dataset and the other is called the testing dataset. Here, we use the testing dataset to calculate our evaluation metrics, and the rest of the data is used to train the model. This is the process of holdout cross-validation.

The main advantage of holdout cross-validation is that it is very easy to implement and it is a very intuitive method of cross-validation.

The problem with this kind of cross-validation is that it provides a single estimate for the evaluation metric of the model. This is problematic because some models rely on randomness. So in principle, it is possible that the evaluation metrics calculated on the test sometimes they will vary a lot because of random chance. So the main problem with holdout cross-validation...

K-fold cross-validation

In k-fold cross-validation, we basically do holdout cross-validation many times. So in k-fold cross-validation, we partition the dataset into k equal-sized samples. Of these many k subsamples, a single subsample is retained as the validation data for testing the model, and the remaining k−1 subsamples are used as training data. This cross-validation process is then repeated k times, with each of the k subsamples used exactly once as the validation data. The k results can then be averaged to produce a single estimation.

The following screenshot shows a visual example of 5-fold cross-validation (k=5) :

Here, we see that our dataset gets divided into five parts. We use the first part for testing and the rest for training.

The following are the steps we follow in the 5-fold cross-validation method:

  1. We get the first estimation of our evaluation metrics...

Comparing models with k-fold cross-validation

As k-fold cross-validation method proved to be a better method, it is more suitable for comparing models. The reason behind this is that k-fold cross-validation gives much estimation of the evaluation metrics, and on averaging these estimations, we get a better assessment of model performance.

The following shows the code used to import libraries for comparing models:

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
%matplotlib inline

After importing libraries, we'll import the diamond dataset. The following shows the code used to prepare this diamond dataset:

# importing data
data_path= '../data/diamonds.csv'
diamonds = pd.read_csv(data_path)
diamonds = pd.concat([diamonds, pd.get_dummies(diamonds['cut'], prefix='cut', drop_first=True)],axis=1)
diamonds = pd.concat([diamonds, pd.get_dummies...

Introduction to hyperparameter tuning

The method used to choose the best estimators for a particular dataset or choosing the best values for all hyperparameters is called hyperparameter tuning. Hyperparameters are parameters that are not directly learned within estimators. Their value is decided by the modelers.

For example, in the RandomForestClassifier object, there are a lot of hyperparameters, such as n_estimators, max_depth, max_features, and min_samples_split. Modelers decide the values for these hyperparameters.

Exhaustive grid search

One of the most important and generally-used methods for performing hyperparameter tuning is called the exhaustive grid search. This is a brute-force approach because it tries all of...

Summary

In this chapter, we learned about cross-validation, and different methods of cross-validation, including holdout cross-validation and k-fold cross-validation. We came to know that k-fold cross-validation is nothing but doing holdout cross-validation many times. We implemented k-fold cross-validation using the diamond dataset. We also compared different models using k-fold cross-validation and found the best-performing model, which was the random forest model.

Then, we discussed hyperparameter tuning. We came across the exhaustive grid-search method, which is used to perform hyperparameter tuning. We implemented hyperparameter tuning again using the diamond dataset. We also compared tuned and untuned models, and found that tuned parameters make the model perform better than untuned ones.

In the next chapter, we will study feature selection methods, dimensionality reduction...

lock icon The rest of the chapter is locked
You have been reading a chapter from
Mastering Predictive Analytics with scikit-learn and TensorFlow
Published in: Sep 2018 Publisher: Packt ISBN-13: 9781789617740
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime}