Types of dataset and splits
Typically, we train our model on the training set and test the model on an independent unseen dataset called the test set. We do this to do a fair evaluation of the model. If we don’t do this and train the model on the full dataset and evaluate the model on the same dataset, we don’t know how good the model would do on unseen data, plus the model will likely be overfitted.
We may encounter three kinds of datasets in machine learning:
- Training set: A dataset on which the model is trained.
- Validation set: A dataset used for tuning the hyperparameters of the model. A validation set is often referred to as a development set.
- Evaluation set or test set: A dataset used for evaluating the performance of the model.
When working with small example datasets, it’s common to allocate 80% of the data for the training set, 10% for the validation set, and 10% for the test set. However, the specific ratio between training and test sets is not as important as ensuring that the test set is large enough to provide statistically meaningful evaluation results. In the context of big data, a split of 98%, 1%, and 1% for training, validation, and test sets, respectively, could be appropriate.
Often, people don’t have a dedicated validation set for hyperparameter tuning and refer to the test set as an evaluation set. This can happen when the hyperparameter tuning is not performed as a part of the regular training cycle and is a one-off activity.
Cross-validation
Cross-validation can be a confusing term to guess its meaning. Breaking it down: cross + validation, so it’s some sort of validation on an extended (cross) something. Something here is the test set for us.
Let’s see what cross-validation is:
- Cross-validation is a technique that’s used to estimate how accurately a model will perform in practice
- It is typically used to detect overfitting – that is, failing to generalize patterns in data, particularly when the amount of data may be limited
Let’s look at the different types of cross-validation:
- Holdout: In the holdout method, we randomly assign data points to two sets, usually called the training set and the test set, respectively. We then train (build a model) on the training set and test (evaluate its performance) on the test set.
- k-fold: This works as follows:
- We randomly shuffle the data.
- We divide all the data into k parts, also known as folds. We train the model on k-1 folds and evaluate it on the remaining fold. We record the performance of this model using our chosen model evaluation metric, then discard this model.
- We repeat this process k times, each time holding out a different subset for testing. We take an average of the evaluation metric values (for example, accuracy) from all the previous models. This average represents the overall performance measure of the model.
k-fold cross-validation is mainly used when you have limited data points, say 100 points. Using 5 or 10 folds is the most common when doing cross-validation.
Let’s look at the common evaluation metrics in machine learning, with a special focus on the ones relevant to problems with imbalanced data.