Cross-validation and Parameter Tuning

Predictive analytics is about making predictions for unknown events. We use it to produce models that generalize data. For this, we use a technique called cross-validation.

Cross-validation is a validation technique for assessing the result of a statistical analysis that generalizes to an independent dataset that gives a measure of out-of-sample accuracy. It achieves the task by averaging over several random partitions of the data into training and test samples. It is often used for hyperparameter tuning by doing cross-validation for several possible values of a parameter and choosing the parameter value that gives the lowest cross-validation average error.

There are two kinds of cross-validation: exhaustive and non-exhaustive. K-fold is an example of non-exhaustive cross-validation. It is a technique for getting a more accurate assessment...

Holdout cross-validation

In holdout cross-validation, we hold out a percentage of observations and so we get two datasets. One is called the training dataset and the other is called the testing dataset. Here, we use the testing dataset to calculate our evaluation metrics, and the rest of the data is used to train the model. This is the process of holdout cross-validation.

The main advantage of holdout cross-validation is that it is very easy to implement and it is a very intuitive method of cross-validation.

The problem with this kind of cross-validation is that it provides a single estimate for the evaluation metric of the model. This is problematic because some models rely on randomness. So in principle, it is possible that the evaluation metrics calculated on the test sometimes they will vary a lot because of random chance. So the main problem with holdout cross-validation...

K-fold cross-validation

In k-fold cross-validation, we basically do holdout cross-validation many times. So in k-fold cross-validation, we partition the dataset into k equal-sized samples. Of these many k subsamples, a single subsample is retained as the validation data for testing the model, and the remaining k−1 subsamples are used as training data. This cross-validation process is then repeated k times, with each of the k subsamples used exactly once as the validation data. The k results can then be averaged to produce a single estimation.

The following screenshot shows a visual example of 5-fold cross-validation (k=5) :

Here, we see that our dataset gets divided into five parts. We use the first part for testing and the rest for training.

The following are the steps we follow in the 5-fold cross-validation method:

We get the first estimation of our evaluation metrics...

Comparing models with k-fold cross-validation

As k-fold cross-validation method proved to be a better method, it is more suitable for comparing models. The reason behind this is that k-fold cross-validation gives much estimation of the evaluation metrics, and on averaging these estimations, we get a better assessment of model performance.

The following shows the code used to import libraries for comparing models:

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
%matplotlib inline

After importing libraries, we'll import the diamond dataset. The following shows the code used to prepare this diamond dataset:

# importing data
data_path= '../data/diamonds.csv'
diamonds = pd.read_csv(data_path)
diamonds = pd.concat([diamonds, pd.get_dummies(diamonds['cut'], prefix='cut', drop_first=True)],axis=1)
diamonds = pd.concat([diamonds, pd.get_dummies...

Introduction to hyperparameter tuning

The method used to choose the best estimators for a particular dataset or choosing the best values for all hyperparameters is called hyperparameter tuning. Hyperparameters are parameters that are not directly learned within estimators. Their value is decided by the modelers.

For example, in the RandomForestClassifier object, there are a lot of hyperparameters, such as n_estimators, max_depth, max_features, and min_samples_split. Modelers decide the values for these hyperparameters.

Exhaustive grid search

One of the most important and generally-used methods for performing hyperparameter tuning is called the exhaustive grid search. This is a brute-force approach because it tries all of...

Summary

In this chapter, we learned about cross-validation, and different methods of cross-validation, including holdout cross-validation and k-fold cross-validation. We came to know that k-fold cross-validation is nothing but doing holdout cross-validation many times. We implemented k-fold cross-validation using the diamond dataset. We also compared different models using k-fold cross-validation and found the best-performing model, which was the random forest model.

Then, we discussed hyperparameter tuning. We came across the exhaustive grid-search method, which is used to perform hyperparameter tuning. We implemented hyperparameter tuning again using the diamond dataset. We also compared tuned and untuned models, and found that tuned parameters make the model perform better than untuned ones.

In the next chapter, we will study feature selection methods, dimensionality reduction...