You're reading from Machine Learning Engineering with MLflow

Product typeBook

Published inAug 2021

PublisherPackt

ISBN-139781800560796

Edition1st Edition

Tools

Maven

Concepts

Machine Learning

Author (1)

Natu Lauchande

Chapter 8: Training Models with MLflow

In this chapter, you will learn about creating production-ready training jobs with MLflow. In the bigger scope of things, we will focus on how to move from the training jobs in the notebook environment that we looked at in the early chapters to a standardized format and blueprint to create training jobs.

Specifically, we will look at the following sections in this chapter:

Creating your training project with MLflow
Implementing the training job
Evaluating the model
Deploying the model in the Model Registry
Creating a Docker image for your training job

It's time to add to the pyStock machine learning (ML) platform training infrastructure to take proof-of-concept models created in the workbench developed in Chapter 3, Your Data Science Workbench to a Production Environment.

In this chapter, you will be developing a training project that runs periodically or when triggered by a dataset arrival....

Technical requirements

For this chapter, you will need the following prerequisites:

The latest version of Docker installed on your machine. If you don't already have it installed, please follow the instructions at https://docs.docker.com/get-docker/.
The latest version of Docker Compose installed—please follow the instructions at https://docs.docker.com/compose/install/.
Access to Git in the command line, and installed as described at https://git-scm.com/book/en/v2/Getting-Started-Installing-Git.
Access to a Bash terminal (Linux or Windows).
Access to a browser.
Python 3.5+ installed.
The latest version of your ML library installed locally as described in Chapter 4, Experiment Management in MLflow

Creating your training project with MLflow

You receive a specification from a data scientist based on the XGBoost model being ready to move from a proof-of-concept to a production phase.

We can review the original Jupyter notebook from which the model was registered initially by the data scientist, which is a starting point to start creating an ML engineering pipeline. After initial prototyping and training in the notebook, they are ready to move to production.

Some companies go directly to productionize the notebooks themselves and this is definitely a possibility, but it becomes impossible for the following reasons:

It's hard to version notebooks.
It's hard to unit-test the code.
It's unreliable for long-running tests.

With these three distinct phases, we ensure reproducibility of the training data-generation process and visibility and clear separation of the different steps of the process.

We will start by organizing our MLflow project...

Implementing the training job

We will use the training data produced in the previous chapter. The assumption here is that an independent job populates the data pipeline in a specific folder. In the book's GitHub repository, you can look at the data in https://github.com/PacktPublishing/Machine-Learning-Engineering-with-MLflow/blob/master/Chapter08/psystock-training/data/training/data.csv.

We will now create a train_model.py file that will be responsible for loading the training data to fit and produce a model. Test predictions will be produced and persisted in the environment so that other steps of the workflow can use the data to evaluate the model.

The file produced in this section is available at the following link:

https://github.com/PacktPublishing/Machine-Learning-Engineering-with-MLflow/blob/master/Chapter08/psystock-training/train_model.py:

We will start by importing the relevant packages. In this case, we will need pandas to handle the data, xgboost...

Evaluating the model

We will now move on to collect evaluation metrics for our model, to add to the metadata of the model.

We will work on the evaluate_model.py file. You can follow along by working in an empty file or by going to https://github.com/PacktPublishing/Machine-Learning-Engineering-with-MLflow/blob/master/Chapter08/psystock-training/evaluate_model.py. Proceed as follows:

Import the relevant packages—pandas and mlflow—for reading and running the steps, respectively. We will rely on importing a selection of model-evaluation metrics available in sklearn for classification algorithms, as follows:

import pandas as pd
import mlflow
from sklearn.model_selection import train_test_split
from sklearn.metrics import  \
    classification_report, \
    confusion_matrix, \
    accuracy_score, \
    auc, \
    average_precision_score, \
   ...

Deploying the model in the Model Registry

Next, you should add the register_model.py function to register the model in the Model Registry.

This is as simple as executing the mlflow.register_model method with the Uniform Resource Identifier (URI) of the model and the name of the model. Basically, a model will be created if it doesn't already exist. If it's already in the registry, a new version will be added, allowing the deployment tools to look at the models and trace the training jobs and metrics. It also allows a decision to be made as to whether to promote the model to production or not. The code you'll need is illustrated in the following snippet:

import mlflow
if __name__ == "__main__":
    
    with mlflow.start_run(run_name="register_model") as run:
        mlflow.set_tag("mlflow.runName", "register_model")
     ...

Creating a Docker image for your training job

A Docker image is, in many contexts, the most critical deliverable of a model developer to a more specialized systems infrastructure team in production for a training job. The project is contained in the following folder of the repository: https://github.com/PacktPublishing/Machine-Learning-Engineering-with-MLflow/tree/master/Chapter08/psystock-training-docker. In the following steps, we will produce a ready-to-deploy Docker image of the code produced:

You need to set up a Docker file in the root folder of the project, as shown in the following code snippet:

FROM continuumio/miniconda3:4.9.2
RUN apt-get update && apt-get install build-essential -y
RUN pip install \
    mlflow==1.18.0 \
    pymysql==1.0.2 \
    boto3
COPY ./training_project /src
WORKDIR /src

We will start by building and training the image by running the following command:
```
docker build -t psystock_docker_training_image...
```

Summary

In this chapter, we introduced the concepts and different features in terms of using MLflow to create production training processes.

We started by setting up the basic blocks of the MLflow training project and followed along throughout the chapter to, in sequence, train a model, evaluate a trained model, and register a trained model. We also delved into the creation of a ready-to-use image for your training job.

This was an important component of the architecture, and it will allow us to build an end-to-end production system for our ML system in production. In the next chapter, we will deploy different components and illustrate the deployment process of models.

Natu Lauchande is a principal data engineer in the fintech space currently tackling problems at the intersection of machine learning, data engineering, and distributed systems. He has worked in diverse industries, including biomedical/pharma research, cloud, fintech, and e-commerce/mobile. Along the way, he had the opportunity to be granted a patent (as co-inventor) in distributed systems, publish in a top academic journal, and contribute to open source software. He has also been very active as a speaker at machine learning/tech conferences and meetups.
Read more about Natu Lauchande

Other recommended products

Related to this chapter

Distributed Data Systems with Azure Databricks

This book helps you to learn how to extract, transform, and orchestrate massive amounts of data to develop robust data pipelines. You'll perform complex machine learning tasks using advanced Azure Databricks features, and also explore model tuning, deployment, and control using Databricks functionalities such as AutoML and Delta Lake with TensorFlow.

BookMay 2021414 pages

Automated Machine Learning

This guide will help you to explore automated machine learning (AutoML), a rapidly growing subfield of machine learning. You’ll learn how you can use AutoML to fully automate the machine learning process even if you’re not an expert, and in turn increase your productivity drastically.

BookFeb 2021312 pages

Engineering MLOps

Get to grips with ML lifecycle management and MLOps implementation for your organization. This book will give you comprehensive insights into MLOps coupled with real-world examples in Azure that will teach you how to write programs, train robust and scalable ML models, and build ML pipelines to train, deploy, and monitor models securely in production.

BookApr 2021370 pages

Amazon SageMaker Best Practices

Going beyond the basics, Amazon SageMaker Best Practices provides end-to-end coverage of the service capabilities that the platform offers for building and automating machine learning workloads to address data science challenges. With this book, you'll discover tips to train, deploy, and monitor your machine learning solutions efficiently.

BookSep 2021348 pages

Learn Amazon SageMaker

This book will teach you how to move quickly from business questions to machine learning models in production. Using real-world examples implemented with Python and Jupyter notebooks, you’ll learn about many the features and APIs of Amazon SageMaker on a wide spectrum of use cases: tabular data, computer vision, and natural language processing.

BookAug 2020490 pages

Python Data Science Essentials

Python Data Science Essentials, Third Edition provides modern insight in setting up and performing data science operations effectively using the latest python tools and libraries. It builds faster governance on the most essential tasks such as data munging and pre-processing, along with all the techniques you require.

BookSep 2018472 pages

Mastering Azure Machine Learning

This book will help you learn how to build a scalable end-to-end machine learning pipeline in Azure from experimentation and training to optimization and deployment. By the end of this book, you will learn to build complex distributed systems and scalable cloud infrastructure using powerful machine learning algorithms to compute insights.

BookApr 2020436 pages

Personalised recommendations for you

Based on your interests and search pattern

Et al.

Ever wonder why speech recognition systems don't understand the Scottish accent, or what would happen if an astronaut only ate mac 'n' cheese, or other spurious reflections you'd have at a bar? We did, then collated those deliberations into absurd research articles with fake figures and methodologies inspired by even more fictionally absurd studies.

BookAug 2023230 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages4

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages1

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Mastering Tableau 2023

This book is a comprehensive resource to mastering your Tableau skills and becoming a BI expert. As you progress, you will learn how to build advanced dashboards and improve your storytelling to derive key business insight, as well as make you well-versed with advanced functionalities of Tableau in the business intelligence domain.

BookAug 2023684 pages

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages5

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages2

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

BookOct 2023636 pages5

Modern Data Architecture on AWS

Every organization wants an agile, performant, and cost-effective data platform that meets all their current and future business needs. Purpose-built AWS analytics services and their features play a big part in building such a modern data platform. This book brings to you all the design and architectural patterns that’ll help you achieve this goal.

BookAug 2023420 pages5

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

BookDec 2023240 pages

TinyML Cookbook

With over 70 project-based recipes, the TinyML Cookbook is a practical guide that will help you to get the most out of your microcontrollers. It provides a comprehensive understanding of the theoretical foundations while giving you hands-on experience training ML models for deployment on Arduino Nano 33 BLE Sense, Raspberry Pi Pico, and SparkFun RedBoard Artemis Nano microcontrollers.

BookNov 2023664 pages

You're reading from Machine Learning Engineering with MLflow

Chapter 8: Training Models with MLflow

Technical requirements

Creating your training project with MLflow

Implementing the training job

Evaluating the model

Deploying the model in the Model Registry

Creating a Docker image for your training job

Summary

Further reading

Unlock this book and the full library FREE for 7 days

Author (1)

Distributed Data Systems with Azure Databricks

Automated Machine Learning

This guide will help you to explore automated machine learning (AutoML), a rapidly growing subfield of machine learning. You’ll learn how you can use AutoML to fully automate the machine learning process even if you’re not an expert, and in turn increase your productivity drastically.

Engineering MLOps

Amazon SageMaker Best Practices

Learn Amazon SageMaker

Python Data Science Essentials

Mastering Azure Machine Learning

Et al.

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

Mastering Tableau 2023

Building AI Applications with ChatGPT APIs

Building AI Applications with ChatGPT APIs

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

Modern Data Architecture on AWS

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

TinyML Cookbook