Reader small image

You're reading from  Machine Learning Engineering with Python - Second Edition

Product typeBook
Published inAug 2023
Reading LevelIntermediate
PublisherPackt
ISBN-139781837631964
Edition2nd Edition
Languages
Right arrow
Author (1)
Andrew P. McMahon
Andrew P. McMahon
author image
Andrew P. McMahon

Andrew P. McMahon has spent years building high-impact ML products across a variety of industries. He is currently Head of MLOps for NatWest Group in the UK and has a PhD in theoretical condensed matter physics from Imperial College London. He is an active blogger, speaker, podcast guest, and leading voice in the MLOps community. He is co-host of the AI Right podcast and was named ‘Rising Star of the Year' at the 2022 British Data Awards and ‘Data Scientist of the Year' by the Data Science Foundation in 2019.
Read more about Andrew P. McMahon

Right arrow

Setting up our tools

To prepare for the work in the rest of this chapter, and indeed the rest of the book, it will be helpful to set up some tools. At a high level, we need the following:

  1. Somewhere to code
  2. Something to track our code changes
  3. Something to help manage our tasks
  4. Somewhere to provision infrastructure and deploy our solution

Let's look at how to approach each of these in turn:

  1. Somewhere to code: First, although the weapon of choice for coding by data scientists is of course Jupyter Notebook (other solutions are available), once you begin to make the move toward ML engineering, it will be important to have an Integrated Development Environment (IDE) to hand. An IDE is basically an application that comes with a series of built-in tools and capabilities to help you to develop the best software that you can. PyCharm is an excellent example for Python developers and comes with a wide variety of plugins, add-ons, and integrations useful to the ML engineer. You can download...

Setting up an AWS account

As previously stated, you don't have to use AWS, but that's what we're going to use throughout this book. Once it's set up here, you can use it for everything we'll do:

  1. To set up an AWS account, navigate to aws.amazon.com and select Create Account. You will have to add some payment details but everything we mention in this book can be explored through the free tier of AWS, where you do not incur a cost below some set threshold of consumption.
  2. Once you have created your account, you can navigate to the AWS Management Console, where you can see all of the services that are available to you (see Figure 2.5):

    Figure 2.5 – The AWS Management Console
  3. Finally, there would be no ML engineering without ML models. So, the final piece of software you should install is one that will help you track and serve your models in a consistent way. For this, we will use MLflow, an open source platform from Databricks and under the stewardship of...

Concept to solution in four steps

All ML projects are unique in some way: the organization, the data, the people, and the tools and techniques employed will never be exactly the same for any two projects. This is good, as it signifies progress as well as the natural variety that makes this such a fun space to work in.

That said, no matter the details, broadly speaking, all successful ML projects actually have a good deal in common. They require translation of a business problem into a technical problem, a lot of research and understanding, proofs of concept, analyses, iterations, consolidation of work, construction of the final product, and deployment to an appropriate environment. That is ML engineering in a nutshell!

Developing this a bit further, you can start to bucket these activities into rough categories or stages, the results of each being necessary inputs for later stages. This is shown in Figure 2.6:

Figure 2.6 – The stages that any ML project goes through as part of the ML development process

Summary

This chapter was all about building a solid foundation for future work. We discussed the development steps common to all ML engineering projects, which we called discover, play, develop, deploy. In particular, we outlined the aim of each of these steps and their desired outputs.

This was followed by a high-level discussion of tooling, and a walkthrough of the main setup steps. We set up the tools for developing our code, keeping track of the changes of that code, managing our ML engineering project, and finally, deploying our solutions.

In the rest of the chapter, we went through the details for each of the four stages we outlined previously, with a particular focus on the develop and deploy stages. Our discussion covered everything from the pros and cons of Waterfall and Agile development methodologies to environment management and then software development best practices. We also discussed how to apply testing to our ML code. We finished off with an exploration of how to package...

Engineering features for machine learning

Before we feed any data into an ML model, it has to be transformed into a state that can be understood by our models. We also need to make sure we only do this on the data we deem useful for improving the performance of the model, as it is far too easy to explode the number of features and fall victim to the curse of dimensionality. This refers to a series of related observations where, in high-dimensional problems, data becomes increasingly sparse in the feature space, so achieving statistical significance can require exponentially more data. In this section, we will not cover the theoretical basis of feature engineering. Instead, we will focus on how we, as ML engineers, can help automate some of the steps in production. To this end, we will quickly recap the main types of feature preparation and feature engineering steps so that we have the necessary pieces to add to our pipelines later in this chapter.

Engineering categorical features...

Designing your training system

Viewed at the highest level, ML models go through a life cycle with two stages: a training phase and an output phase. During the training phase, the model is fed data to learn from the dataset. In the prediction phase, the model, complete with its optimized parameters, is fed new data in order and returns the desired output.

These two phases have very different computational and processing requirements. In the training phase, we have to expose the model to as much data as we can to gain the best performance, all while ensuring subsets of data are kept aside for testing and validation. Model training is fundamentally an optimization problem, which requires several incremental steps to get to a solution.

Therefore, this is computationally demanding, and in cases where the data is relatively large (or compute resources are relatively low), it can take a long time. Even if you had a small dataset and a lot of computational resources, training is...

Retraining required

You wouldn’t expect that after finishing your education, you never read a paper or book or speak to anyone again, which would mean you wouldn’t be able to make informed decisions about what is happening in the world. So, you shouldn’t expect an ML model to be trained once and then be performant forever afterward.

This idea is intuitive, but it represents a formal problem for ML models known as drift. Drift is a term that covers a variety of reasons for your model’s performance dropping over time. It can be split into two main types:

  • Concept drift: This happens when there is a change in the fundamental relationship between the features of your data and the outcome you are trying to predict. Sometimes, this is also known as covariate drift. An example could be that at the time of training, you only have a subsample of data that seems to show a linear relationship between the features and your outcome. If it turns out that...

Persisting your models

In the previous chapter, we introduced some of the basics of model version control using MLflow. In particular, we discussed how to log metrics for your ML experiments using the MLflow Tracking API. We are now going to build on this knowledge and consider the touchpoints our training systems should have with model control systems in general.

First, let’s recap what we’re trying to do with the training system. We want to automate (as far as possible) a lot of the work that was done by the data scientists in finding the first working model, so that we can continually update and create new model versions that still solve the problem in the future. We would also like to have a simple mechanism that allows the results of the training process to be shared with the part of the solution that will carry out the prediction when in production. We can think of our model version control system as a bridge between the different stages of the ML development...

Building the model factory with pipelines

The concept of a software pipeline is intuitive enough. If you have a series of steps chained together in your code, so that the next step consumes or uses the output of the previous step or steps, then you have a pipeline.

In this section, when we refer to a pipeline, we will be specifically dealing with steps that contain processing or calculations that are appropriate to ML. For example, the following diagram shows how this concept may apply to some of the steps the marketing classifier mentioned in Chapter 1, Introduction to ML Engineering:

Figure 3.13: The main stages of any training pipeline and how this maps to a specific case from Chapter 1, Introduction to ML Engineering.

Let’s discuss some of the standard tools for building up your ML pipelines in code.

Scikit-learn pipelines

Our old friend Scikit-Learn comes packaged with some nice pipelining functionality. The API is extremely easy to use, as you...

Summary

In this chapter, we learned about the important topic of how to build up our solutions for training and staging the ML models that we want to run in production. We split the components of such a solution into pieces that tackled training the models, the persistence of the models, serving the models, and triggering retraining for the models. I termed this the “Model Factory.”

We got into the more technical details of some important concepts with a deep dive into what training an ML model really means, which we framed as learning about how ML models learn. Some time was then spent on the key concepts of feature engineering, or how you transform your data into something that a ML model can understand during this process. This was followed by sections on how to think about the different modes your training system can run in, which I termed “train-persist” and “train-run.”

We then discussed how you can perform drift detection on...

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Machine Learning Engineering with Python - Second Edition
Published in: Aug 2023Publisher: PacktISBN-13: 9781837631964
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at €14.99/month. Cancel anytime

Author (1)

author image
Andrew P. McMahon

Andrew P. McMahon has spent years building high-impact ML products across a variety of industries. He is currently Head of MLOps for NatWest Group in the UK and has a PhD in theoretical condensed matter physics from Imperial College London. He is an active blogger, speaker, podcast guest, and leading voice in the MLOps community. He is co-host of the AI Right podcast and was named ‘Rising Star of the Year' at the 2022 British Data Awards and ‘Data Scientist of the Year' by the Data Science Foundation in 2019.
Read more about Andrew P. McMahon