Introduction to Cross-Validation
Cross-validation is a cornerstone technique for assessing how an ML model will perform on unseen data. Instead of relying on a single train-test split, we divide the dataset into multiple subsets, training and validating the model several times to get a better estimate of its generalization ability. In this recipe, we’ll explore different types of cross-validation, including k-fold and stratified k-fold cross-validation, and walk through how to implement them using scikit-learn.
k-folds
The k in k-fold indicates the number of folds or subsets we’ll split our dataset into. The term k is used similarly in k-means clustering we saw in the previous chapter.
Getting ready
We begin by loading the libraries and dataset we’ll use to demonstrate cross-validation strategies. We’ll use a classification dataset generated by make_classification
.
Load the libraries:
import numpy as np from sklearn.datasets import make_classification from...