Reader small image

You're reading from  Machine Learning Engineering with Python - Second Edition

Product typeBook
Published inAug 2023
Reading LevelIntermediate
PublisherPackt
ISBN-139781837631964
Edition2nd Edition
Languages
Right arrow
Author (1)
Andrew P. McMahon
Andrew P. McMahon
author image
Andrew P. McMahon

Andrew P. McMahon has spent years building high-impact ML products across a variety of industries. He is currently Head of MLOps for NatWest Group in the UK and has a PhD in theoretical condensed matter physics from Imperial College London. He is an active blogger, speaker, podcast guest, and leading voice in the MLOps community. He is co-host of the AI Right podcast and was named ‘Rising Star of the Year' at the 2022 British Data Awards and ‘Data Scientist of the Year' by the Data Science Foundation in 2019.
Read more about Andrew P. McMahon

Right arrow

Train-persist

Option 2 is that training runs in batch, while prediction runs in whatever mode is deemed appropriate, with the prediction solution reading in the trained model from a store. We will call this design pattern train-persist. This is shown in the following diagram:

Figure 3.3 – The train-persist process

If we are going to train our model and then persist the model so that it can be picked up later by a prediction process, then we need to ensure a few things are in place:

  • What are our model storage options?
  • Is there a clear mechanism for accessing our model store (writing to and reading from)?
  • How often should we train versus how often will we predict?

In our case, we will solve the first two questions by using MLflow, which we introduced in Chapter 2, The Machine Learning Development Process, but will revisit in later sections. There are also lots of other solutions available. The key point is that no matter what you use as a model store and handover point between...

Retraining required

You wouldn't expect that after finishing your education, you never read a paper or book or speak to anyone again, which means you wouldn't be able to make informed decisions about what is happening in the world. So, you shouldn't expect a ML model to be trained once and then be performant forever afterward.

This idea is intuitive, but it represents a formal problem for ML models known as drift. Drift is a term that covers a variety of reasons for your model's performance dropping over time. It can be split into two main types:

  1. Concept drift: This happens when there is a change in the fundamental relationship between the features of your data and the outcome you are trying to predict. Sometimes, this is also known as covariate drift. An example could be that at the time of training, you only have a subsample of data that seems to show a linear relationship between the features and your outcome. If it turns out that, after gathering a lot more data...

Learning about learning

At their heart, ML algorithms all contain one key feature: an optimization of some kind. The fact that these algorithms learn (meaning that they iteratively improve their performance concerning an appropriate metric upon exposure to more observations) is what makes them so powerful and exciting. This process of learning is what we refer to when we say training.

In this section, we will cover the key concepts underpinning training, the options we can select in our code, and what these mean for the potential performance and capabilities of our training system.

Defining the target

We have just stated that training is an optimization, but what exactly are we optimizing? Let's consider supervised learning. In training, we provide the labels or values that we would want to predict for the given feature so that the algorithms can learn the relationship between the features and the target. To optimize the internal parameters of the algorithm during training, it needs...

AutoML

The final level of our hierarchy is the one where we, as the engineer, have the least direct control over the training process, but where we also potentially get a good answer for very little effort!

The development time that's required to search through many hyperparameters and algorithms for your problem can be large, even when you code up reasonable-looking search parameters and loops.

Given this, the past few years have seen the deployment of several AutoML libraries and tools in a variety of languages and software ecosystems. The hype surrounding these techniques has meant they have had a lot of airtime, which has led to several data scientists questioning when their jobs will be automated away. As we mentioned previously in this chapter, in my opinion, declaring the death of data science is extremely premature and also dangerous from an organizational and business performance standpoint. These tools have been given such a pseudo-mythical status that many companies could...

Auto-sklearn

One of our favorite libraries, good old scikit-learn, was always going to be one of the first targets for building a popular AutoML library. One of the very powerful features of auto-sklearn is that its API has been designed so that the main objects that optimize and section models and hyperparameters can be swapped seamlessly into your code.

As usual, an example will show this more clearly. In the following example, we will assume that the wine dataset (a favorite for this chapter) has already been retrieved and split into train and test samples in line with other examples, such as the one in the Detecting drift section:

  1. First, since this is a classification problem, the main thing we need to get from auto-sklearn is the autosklearn.classification object:
import numpy as np
import sklearn.datasets
import sklearn.metrics
import autosklearn.classification
  1. We must then define our auto-sklearn object. This provides several parameters that help us define how the model and...

Persisting your models

In the previous chapter, we introduced some of the basics of model version control using MLflow. In particular, we discussed how to log metrics for your ML experiments using the MLflow Tracking API. We are now going to build on this knowledge and consider the touchpoints our training systems should have with model control systems in general.

First, let's recap what we're trying to do with the training system. We want to automate (as far as possible) a lot of the work that was done by the data scientists in finding the first working model, so that we can continually update and create new model versions that still solve the problem in the future. We would also like to have a simple mechanism that allows the results of the training process to be shared with the part of the solution that will carry out the prediction when in production. We can think of our model version control system as a bridge between the different stages of the ML development process we...

Building the model factory with pipelines

The concept of a software pipeline is intuitive enough. If you have a series of steps chained together in your code, so that the next step consumes or uses the output of the previous step or steps, then you have a pipeline.

In this section, when we refer to a pipeline, we will be specifically dealing with steps that contain processing or calculations that are appropriate to ML. For example, the following diagram shows how this concept may apply to some of the steps the marketing classifier mentioned in Chapter 1, Introduction to ML Engineering:

Figure 3.11 – The main stages of any training pipeline and how this maps to a specific case from Chapter 1, Introduction to ML Engineering.

Let's discuss some of the standard tools for building up your ML pipelines in code.

Scikit-learn pipelines

Our old friend scikit-learn comes packaged with some nice pipelining functionality. At the time of writing, scikit-learn versions greater than...

Going Deep

In this chapter, we have so far been working with relatively ‘classical’ machine learning models, which rely on a variety of different approaches to learn from data often motivated by mathematical arguments from researchers. These algorithms in general are not modelled on any biological theory of learning and are at their heart motivated by the statistics and mathematics. A slightly different approach that the reader will likely be aware of, and that we have met briefly in the sections on Learning About Learning, is that taken by Artificial Neural Networks (ANNs), which originated in the 1950s and were based on idealized models of neuronal activity in the brain. The core concept of an ANN is that through connecting relatively simple computational units called neurons (modelled on biological neurons) we can build systems that can effectively model any mathematical function. The neuron in this case is a small component of the system which will return a 0 or 1 result...

Summary

In this chapter, we learned about the important topic of how to build up our solutions for training and surfacing the ML models that we want to run in production. We split the components of such a solution into pieces that tackled training the models, the persistence of the models, serving the models, and triggering retraining for the models.

We conducted a detailed investigation into the reasons why you may want to separate your training and running components for performance reasons. We then discussed how you can perform drift detection on your model performance and data statistics to understand whether retraining should be triggered. This included some examples of performing drift detection using the alibi-detect and evidently.ai packages. We then summarized some of the key concepts of feature engineering, or how you transform your data into something that a ML model can understand. We then went into a deep dive on how ML models learn and what you can control about that process...

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Machine Learning Engineering with Python - Second Edition
Published in: Aug 2023Publisher: PacktISBN-13: 9781837631964
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Andrew P. McMahon

Andrew P. McMahon has spent years building high-impact ML products across a variety of industries. He is currently Head of MLOps for NatWest Group in the UK and has a PhD in theoretical condensed matter physics from Imperial College London. He is an active blogger, speaker, podcast guest, and leading voice in the MLOps community. He is co-host of the AI Right podcast and was named ‘Rising Star of the Year' at the 2022 British Data Awards and ‘Data Scientist of the Year' by the Data Science Foundation in 2019.
Read more about Andrew P. McMahon