Improving Your Model – Pipelines and Experiments

In the previous chapter, we trained a basic machine learning (ML) model. However, most real-world scenarios require models to be accurate, and that means the model and features need to be improved and fine-tuned for a specific task. This process is usually long, iterative, and based on trial and error.

So, in this chapter, we will see how we can improve and validate model quality and keep track of all of the experiments along the way. As a result, we will improve the quality of the model and learn how to track our experiments and log metrics and parameters. In particular, we'll learn the following:

Understanding cross-validation and overfitting
Adding features in order to improve models
Wrapping models and transformations into pipelines
Version control of our datasets and metrics using the dvc package

...

Technical requirements

In this chapter, we will introduce you to the dvc package. If you don't use the environment for this book, you can install it with pip install dvc. The last part, visualization of the tree, will also require the pydotplus package.

As usual, all of the code is shared via a notebook, stored in the repository under Chapter14 (https://github.com/PacktPublishing/Learn-Python-by-Building-Data-Science-Applications).

Understanding cross-validation

In the previous chapter, we built a model with certain assumptions and settings, measuring its performance with accuracy metrics (the overall ratio of correctly classified labels). To do this, we split our data randomly into training and testing sets. While that approach is fundamental, it has its problems. Most importantly, this way, we may fine-tune our model to gain better performance on the test dataset but at the expense of other data (in other words, we might make the model worse while getting a better metric on the specific dataset). This phenomenon is called overfitting.

To combat this issue, we'll use a slightly more complex approach: cross-validation. In its basic form, cross-validation creates multiple so-called folds or data subsections. Usually, each fold has approximately the same size and can be further balanced by target variable...

Exploring feature engineering

Now that we made a system to fairly compare models with no fear of overfitting, let's think about how we can improve our model. One way would be to create new features that might add more context. One way to go about this is to create features of our own, for example, calculate a proportion of armies on different sides or the absolute difference in the number of soldiers—we can't say in advance which would work better. Let's try it out with the help of the following code:

First, we'll create a ratio of soldiers on either side:

data['infantry_ratio'] = data['allies_infantry'] / data['axis_infantry']
cols.append('infantry_ratio')

Now, we won't do that for tanks, planes, and so on, as the numbers here are very small and we'll have to deal with division by zero. Instead, we...

Optimizing the hyperparameters

There are probably a lot of other features to add, but let's now shift our attention to the model itself. For now, we assumed the default, static parameters of the model, restricting its max_depth parameter to an arbitrary number. Now, let's try to fine-tune those parameters. If done properly, this process could add a few additional percentage points to the model accuracy, and sometimes, even a small gain in performance metrics can be a game-changer.

To do this, we'll use RandomizedSearchCV—another wrapper around the concept of cross-validation, but this time, one that iterates over parameters of the model, trying to find the optimal ones. A simpler approach, called GridSearchCV, takes a finite number of parameters, creates all of the permutations, and runs them all iteratively using, essentially, a brute-force approach.

Randomized...

Tracking your data and metrics with version control

As with all ML projects, there is always room for improvement—especially if we converge on the actual use case scenario. But let's switch gears and talk about the technical side of the question.

As you probably noticed, in this chapter, we had to constantly iterate, adding and removing features from the data or settings to the model. And again, as we mentioned, only one-third of the initial experiments went into this book. This is probably fine for this toy dataset and this third of the code but eventually, we might be swamped in different versions and iterations of the model.

In Chapter 9, Shell, Git, Conda, and More – at Your Command, of this book, we learned about git—a system that stores versions of code, so you can safely switch to the previous version or even keep work on different versions of...

Summary

Over the course of this chapter, we worked iteratively on improving the machine learning model we built in Chapter 13, Training a Machine Learning Model—adding features and tuning it to achieve maximum performance. As the code and iterations get more complex and multiple trial-and-error attempts are required, it is important to keep track of your research. Therefore, we further discussed how to keep track of not only the code but also data and metrics, making sure we can always switch back and reproduce any of the previous versions.

In the next chapter, we'll take another stab at our Wikipedia scraping code, building it into an independent Python library you could share with your friends and colleagues. Throughout the rest of this book, we will focus on different ways of delivering our code as a product to the client—as a standalone package, scheduled...

Questions

What is overfitting?
Why should we use cross-validation?
Why can it be bad if our metrics are improving on the test set? Which features are useful for improving model performance on cross-validation?
Why do some features decrease the performance of a decision tree on test data or in cross-validation?
What is the difference between the random search and grid search algorithms for parameter optimization?
Why is Git not sufficient for data version control?
What are the alternatives to DVC for data version control and experimentation logging?