You're reading from Machine Learning Engineering with MLflow

Product typeBook

Published inAug 2021

PublisherPackt

ISBN-139781800560796

Edition1st Edition

Tools

Maven

Concepts

Machine Learning

Author (1)

Natu Lauchande

Exploring MLflow modules

MLflow modules are software components that deliver the core features that aid in the different phases of the ML life cycle. MLflow features are delivered through modules, extensible components that organize related features in the platform.

The following are the built-in modules in MLflow:

MLflow Tracking: Provides a mechanism and UI to handle metrics and artifacts generated by ML executions (training and inference)
Mlflow Projects: A package format to standardize ML projects
Mlflow Models: A mechanism that deploys to different types of environments, both on-premises and in the cloud
Mlflow Model Registry: A module that handles the management of models in MLflow and its life cycle, including state

In order to explore the different modules, we will install MLflow in your local environment using the following command:

pip install mlflow

Important note

It is crucial that the technical requirements are correctly installed on your local machine to allow you to follow along. You can also use the pip command with the required permissions.

Exploring MLflow projects

An MLflow project represents the basic unit of organization of ML projects. There are three different environments supported by MLflow projects: the Conda environment, Docker, and the local system.

Important note

Model details of the different parameters available on an MLProject file can be consulted in the official documentation available at https://www.mlflow.org/docs/latest/projects.html#running-projects.

The following is an example of an MLproject file of a conda environment:

name: condapred
conda_env:
  image: conda.yaml
entry_points:
  main:
    command: "python mljob.py"

In the conda option, the assumption is that there is a conda.yaml file with the required dependencies. MLflow, when asked to run the project, will start the environment with the specified dependencies.

The system-based environment will look like the following; it's actually quite simple:

name: syspred
entry_points:
  main:
    command: "python mljob.py"

The preceding system variant will basically rely on the local environment dependencies, assuming that the underlying operating system contains all the dependencies. This approach is particularly prone to library conflicts with the underlying operating system; it might be valuable in contexts where there is already an existing operating system environment that fits the project.

The following is a Docker environment-based MLproject file:

name: syspred
docker_env:
  image: stockpred-docker
entry_points:
  main:
    command: "python mljob.py"

Once you have your environment, the main file that defines how your project should look is the MLProject file. This file is used by MLflow to understand how it should run your project.

Developing your first end-to-end pipeline in MLflow

We will prototype a simple stock prediction project in this section with MLflow and will document the different files and phases of the solution. You will develop it in your local system using the MLflow and Docker installed locally.

Important note

In this section, we are assuming that MLflow and Docker are installed locally, as the steps in this section will be executed in your local environment.

The task in this illustrative project is to create a basic MLflow project and produce a working baseline ML model to predict, based on market signals over a certain number of days, whether the stock market will go up or down.

In this section, we will use a Yahoo Finance dataset available for quoting the BTC-USD pair in https://finance.yahoo.com/quote/BTC-USD/ over a period of 3 months. We will train a model to predict whether the quote will be going up or not on a given day. A REST API will be made available for predictions through MLflow.

We will illustrate, step by step, the creation of an MLflow project to train a classifier on stock data, using the Yahoo API for financial information retrieved using the package's pandas data reader:

Add your MLProject file:
```
name: stockpred
docker_env:
  image: stockpred-docker
entry_points:
  main:
    command: "python train.py"
```
The preceding MLProject file specifies that dependencies will be managed in Docker with a specific image name. MLflow will try to pull the image using the version of Docker installed on your system. If it doesn't find it, it will try to retrieve it from Docker Hub. For the goals of this chapter, it is completely fine to have MLflow running on your local machine.
The second configuration that we add to our project is the main entry point command. The command to be executed will invoke in the Docker environment the train.py Python file, which contains the code of our project.
Add a Docker file to the project.
Additionally, you can specify the Docker registry URL of your image. The advantage of running Docker is that your project is not bound to the Python language, as we will see in the advanced section of this book. The MLflow API is available in a Rest interface alongside the official clients: Python, Java, and R:
```
FROM continuumio/miniconda:4.5.4
RUN pip install mlflow==1.11.0 \
    && pip install numpy==1.14.3 \
    && pip install scipy \
    && pip install pandas==0.22.0 \
    && pip install scikit-learn==0.20.4 \
    && pip install cloudpickle \
    && pip install pandas_datareader>=0.8.0
```
The preceding Docker image file is based on the open source package Miniconda, a free minimal installer with a minimal set of packages for data science that allow us to control the details of the packages that we need in our environment.
We will specify the version of MLflow (our ML platform), numpy, and scipy for numerical calculations. Cloudpickle allows us to easily serialize objects. We will use pandas to manage data frames, and pandas_datareader to allow us to easily retrieve the data from public sources.
Import the packages required for the project.
On the following listing, we explicitly import all the libraries that we will use during the execution of the training script: the library to read the data, and the different sklearn modules related to the chosen initial ML model:
```
import numpy as np
import datetime
import pandas_datareader.data as web
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import f1_score
import mlflow.sklearn
```
We explicitly chose for the stock market movement detection problem a RandomForestClassifier, due to the fact that it's an extremely versatile and widely accepted baseline model for classification problems.
Acquire your training data.
The component of the code that acquires the Yahoo Finance stock dataset is intentionally small, so we choose a specific interval of 3 months to train our classifier.
The acquire_training_data method returns a pandas data frame with the relevant dataset:
```
def acquire_training_data():
    start = datetime.datetime(2019, 7, 1)
    end = datetime.datetime(2019, 9, 30)
    df = web.DataReader("BTC-USD", 'yahoo', start, end)
    return df
```
The format of the data acquired is the classic format for financial securities in exchange APIs. For every day of the period, we retrieve the following data: the highest value of the stock, the lowest, opening, and close values of the stock, as well as the volume. The final column represents the adjusted close value, the value after dividends, and splits:
Figure 1.1 – Excerpt from the acquired data
Figure 1.2 is illustrative of the target variable that we would like to achieve by means of the current data preparation process:
Figure 1.2 – Excerpt from the acquired data with the prediction column

Make the data usable by scikit-learn.

The data acquired in the preceding step is clearly not directly usable by RandomForestAlgorithm, which thrives on categorical features. In order to facilitate the execution of this, we will transform the raw data into a feature vector using the rolling window technique.

Basically, the feature vector for each day becomes the deltas between the current and previous window days. In this case, we use the previous day's market movement (1 for a stock going up, 0 otherwise):

def digitize(n):
    if n > 0:
        return 1
    return 0
def rolling_window(a, window):
    """
        Takes np.array 'a' and size 'window' as parameters
        Outputs an np.array with all the ordered sequences of values of 'a' of size 'window'
        e.g. Input: ( np.array([1, 2, 3, 4, 5, 6]), 4 )
             Output:
                     array([[1, 2, 3, 4],
                           [2, 3, 4, 5],
                           [3, 4, 5, 6]])
    """
    shape = a.shape[:-1] + (a.shape[-1] - window + 1, window)
    strides = a.strides + (a.strides[-1],)
    return np.lib.stride_tricks.as_strided(a, shape=shape, strides=strides)
def prepare_training_data(data):
    data['Delta'] = data['Close'] - data['Open']
    data['to_predict'] = data['Delta'].apply(lambda d: digitize(d))
    return data

The following example is illustrative of the data frame output produced with the binarized ups and downs of the previous days:

Figure 1.3 – Feature vector with binarized market ups and downs

Train and store your model in MLflow.

This portion of the following code listing calls the data preparation methods declared previously and executes the prediction process.

The main execution also explicitly logs the ML model trained in the current execution in the MLflow environment.

if __name__ == "__main__":
    with mlflow.start_run():
    training_data = acquire_training_data()
    prepared_training_data_df = prepare_training_data(training_data)
    btc_mat = prepared_training_data_df.as_matrix()
    WINDOW_SIZE = 14
    X = rolling_window(btc_mat[:, 7], WINDOW_SIZE)[:-1, :]
    Y = prepared_training_data_df['to_predict'].as_matrix()[WINDOW_SIZE:]
    X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.25, random_state=4284, stratify=Y)
    clf = RandomForestClassifier(bootstrap=True, criterion='gini', min_samples_split=2, min_weight_fraction_leaf=0.0, n_estimators=50, random_state=4284, verbose=0)
    clf.fit(X_train, y_train)
    predicted = clf.predict(X_test)
    mlflow.sklearn.log_model(clf, "model_random_forest")
    mlflow.log_metric("precision_label_0", precision_score(y_test, predicted, pos_label=0))
    mlflow.log_metric("recall_label_0", recall_score(y_test, predicted, pos_label=0))
    mlflow.log_metric("f1score_label_0", f1_score(y_test, predicted, pos_label=0))
    mlflow.log_metric("precision_label_1", precision_score(y_test, predicted, pos_label=1))
    mlflow.log_metric("recall_label_1", recall_score(y_test, predicted, pos_label=1))
    mlflow.log_metric("f1score_label_1", f1_score(y_test, predicted, pos_label=1))

The mlflow.sklearn.log_model(clf, "model_random_forest") method takes care of persisting the model upon training. In contrast to the previous example, we are explicitly asking MLflow to log the model and the metrics that we find relevant. This flexibility in the items to log allows one program to log multiple models into MLflow.

In the end, your project layout should look like the following, based on the files created previously:

├── Dockerfile
├── MLproject
├── README.md
└── train.py

Build your project's Docker image.
In order to build your Docker image, you should run the following command:
```
docker build -t stockpred -f dockerfile
```
This will build the image specified previously with the stockpred tag. This image will be usable in MLflow in the subsequent steps as the model is now logged into your local registry.
Following execution of this command, you should expect a successful Docker build:
```
---> 268cb080fed2
Successfully built 268cb080fed2
Successfully tagged stockpred:latest
```

Run your project.

In order to run your project, you can now run the MLflow project:

mlflow run .

Your output should look similar to the excerpt presented here:

MLFLOW_EXPERIMENT_ID=0 stockpred:3451a1f python train.py' in run with ID '442275f18d354564b6259a0188a12575' ===
              precision    recall  f1-score   support
           0       0.61      1.00      0.76        11
           1       1.00      0.22      0.36         9
    accuracy                           0.65        20
   macro avg       0.81      0.61      0.56        20
weighted avg       0.79      0.65      0.58        20
2020/10/15 19:19:39 INFO mlflow.projects: === Run (ID '442275f18d354564b6259a0188a12575') succeeded ===

This contains a printout of your model, the ID of your experiment, and the metrics captured during the current run.

At this stage, you have a simple, reproducible baseline of a stock predictor pipeline using MLflow that you can improve on and easily share with others.

Re-running experiments

Another extremely useful feature of MLflow is the ability to re-run a specific experiment with the same parameters as it was run with originally.

For instance, you should be able to run your previous project by specifying the GitHub URL of the project:

mlflow run https://github.com/PacktPublishing/Machine-Learning-Engineering-with-MLflow/tree/master/Chapter01/stockpred

Basically, what happens with the previous command is that MLflow clones the repository to a temporary directory and executes it, according to the recipe on MLProject.

The ID of the experiment (or the name) allows you to run the project with the original parameters, thereby enabling complete reproducibility of the project.

The MLflow projects feature allows your project to run in advanced cloud environments such as Kubernetes and Databricks. Scaling your ML job seamlessly is one of the main selling points of a platform such as MLflow.

As you have seen from the current section, the MLflow project module allows the execution of a reproducible ML job that is treated as a self-contained project.

Exploring MLflow tracking

The MLflow tracking component is responsible for observability. The main features of this module are the logging of metrics, artifacts, and parameters of an MLflow execution. It provides vizualisations and artifact management features.

In a production setting, it is used as a centralized tracking server implemented in Python that can be shared by a group of ML practitioners in an organization. This enables improvements in ML models to be shared within the organization.

In Figure 1.4, you can see an interface that logs all the runs of your model and allows you to log your experiment's observables (metrics, files, models and artifacts). For each run, you can look and compare the different metrics and parameters of your module.

It addresses common pain points when model developers are comparing different iterations of their models on different parameters and settings.

The following screenshot presents the different metrics for our last run of the previous model:

Figure 1.4 – Sample of the MLFlow interface/UI

MLflow allows the inspection of arbitrary artifacts associated with each model and its associated metadata, allowing metrics of different runs to be compared. You can see the RUN IDs and the Git hash of the code that generated the specific run of your experiment:

Figure 1.5 – Inspecting logged model artifacts

In your current directory of stockpred, you can run the following command to have access to the results of your runs:

mlflow ui

Running the MLflow UI locally will make it available at the following URL: http://127.0.0.1:5000/.

In the particular case of the runs shown in the following screenshot, we have a named experiment where the parameter of the size of the window in the previous example was tweaked. Clear differences can be seen between the performance of the algorithms in terms of F1 score:

Figure 1.6 – Listing of MLflow runs

Another very useful feature of MLFlow tracking is the ability to compare between different runs of jobs:

Figure 1.7 – Comparison of F1 metrics of job runs

This preceding visualization allows a practitioner to make a decision as to which model to use in production or whether to iterate further.

Exploring MLflow Models

MLflow Models is the core component that handles the different model flavors that are supported in MLflow and intermediates the deployment into different execution environments.

We will now delve into the different models supported in the latest version of MLflow.

As shown in the Getting started with MLflow section, MLflow models have a specific serialization approach for when the model is persisted in its internal format. For example, the serialized folder of the model implemented on the stockpred project would look like the following:

├── MLmodel
├── conda.yaml
└── model.pkl

Internally, MLflow sklearn models are persisted with the conda files with their dependencies at the moment of being run and a pickled model as logged by the source code:

artifact_path: model_random_forest
flavors:
  python_function:
    env: conda.yaml
    loader_module: mlflow.sklearn
    model_path: model.pkl
    python_version: 3.7.6
  sklearn:
    pickled_model: model.pkl
    serialization_format: cloudpickle
    sklearn_version: 0.23.2
run_id: 22c91480dc2641b88131c50209073113
utc_time_created: '2020-10-15 20:16:26.619071'
~

MLflow, by default, supports serving models in two flavors, namely, as a python_function or in sklearn format. The flavors are basically a format to be used by tools or environments serving models.

A good example of using the preceding is being able to serve your model without any extra code by executing the following command:

mlflow models serve -m ./mlruns/0/b9ee36e80a934cef9cac3a0513db515c/artifacts/model_random_forest/

You have access to a very simple web server that can run your model. Your model prediction interface can be executed by running the following command:

curl http://127.0.0.1:5000/invocations -H 'Content-Type: application/json' -d '{"data":[[1,1,1,1,0,1,1,1,0,1,1,1,0,0]]}' [1]%

The response to the API call to our model was 1; as defined in our predicted variable, this means that in the next reading, the stock will move up.

The final few steps outline how powerful MLflow is as an end-to-end tool for model development, including for the prototyping of REST-based APIs for ML services.

The MLflow Models component allows the creation of custom-made Python modules that will have the same benefits as the built-in models, as long as a prediction interface is followed.

Some of the notable model types supported will be explored in upcoming chapters, including the following:

XGBoost model format
R functions
H2O model
Keras
PyTorch
Sklearn
Spark MLib
TensorFlow
Fastai

Support for the most prevalent ML types of models, combined with its built-in capability for on-premises and cloud deployment, is one of the strongest features of MLflow Models. We will explore this in more detail in the deployment-related chapters.

Exploring MLflow Model Registry

The model registry component in MLflow gives the ML developer an abstraction for model life cycle management. It is a centralized store for an organization or function that allows models in the organization to be shared, created, and archived collaboratively.

The management of the model can be made with the different APIs of MLflow and with the UI. Figure 1.7 demonstrates the Artifacts UI in the tracking server that can be used to register a model:

Figure 1.8 – Registering a model as an artifact

Upon registering the model, you can annotate the registered model with the relevant metadata and manage its life cycle. One example is to have models in a staging pre-production environment and manage the life cycle by sending the model to production:

Figure 1.9 – Managing different model versions and stages

The model registry module will be explored further in the book, with details on how to set up a centralized server and manage ML model life cycles, from conception through to phasing out a model.

You have been reading a chapter from

Machine Learning Engineering with MLflow

Published in: Aug 2021Publisher: PacktISBN-13: 9781800560796

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at $15.99/month. Cancel anytime

Author (1)

Natu Lauchande

Natu Lauchande is a principal data engineer in the fintech space currently tackling problems at the intersection of machine learning, data engineering, and distributed systems. He has worked in diverse industries, including biomedical/pharma research, cloud, fintech, and e-commerce/mobile. Along the way, he had the opportunity to be granted a patent (as co-inventor) in distributed systems, publish in a top academic journal, and contribute to open source software. He has also been very active as a speaker at machine learning/tech conferences and meetups.
Read more about Natu Lauchande

Other recommended products

Related to this chapter

Distributed Data Systems with Azure Databricks

This book helps you to learn how to extract, transform, and orchestrate massive amounts of data to develop robust data pipelines. You'll perform complex machine learning tasks using advanced Azure Databricks features, and also explore model tuning, deployment, and control using Databricks functionalities such as AutoML and Delta Lake with TensorFlow.

BookMay 2021414 pages

Automated Machine Learning

This guide will help you to explore automated machine learning (AutoML), a rapidly growing subfield of machine learning. You’ll learn how you can use AutoML to fully automate the machine learning process even if you’re not an expert, and in turn increase your productivity drastically.

BookFeb 2021312 pages

Engineering MLOps

Get to grips with ML lifecycle management and MLOps implementation for your organization. This book will give you comprehensive insights into MLOps coupled with real-world examples in Azure that will teach you how to write programs, train robust and scalable ML models, and build ML pipelines to train, deploy, and monitor models securely in production.

BookApr 2021370 pages

Amazon SageMaker Best Practices

Going beyond the basics, Amazon SageMaker Best Practices provides end-to-end coverage of the service capabilities that the platform offers for building and automating machine learning workloads to address data science challenges. With this book, you'll discover tips to train, deploy, and monitor your machine learning solutions efficiently.

BookSep 2021348 pages

Learn Amazon SageMaker

This book will teach you how to move quickly from business questions to machine learning models in production. Using real-world examples implemented with Python and Jupyter notebooks, you’ll learn about many the features and APIs of Amazon SageMaker on a wide spectrum of use cases: tabular data, computer vision, and natural language processing.

BookAug 2020490 pages

Python Data Science Essentials

Python Data Science Essentials, Third Edition provides modern insight in setting up and performing data science operations effectively using the latest python tools and libraries. It builds faster governance on the most essential tasks such as data munging and pre-processing, along with all the techniques you require.

BookSep 2018472 pages

Mastering Azure Machine Learning

This book will help you learn how to build a scalable end-to-end machine learning pipeline in Azure from experimentation and training to optimization and deployment. By the end of this book, you will learn to build complex distributed systems and scalable cloud infrastructure using powerful machine learning algorithms to compute insights.

BookApr 2020436 pages

Personalised recommendations for you

Based on your interests and search pattern

Et al.

Ever wonder why speech recognition systems don't understand the Scottish accent, or what would happen if an astronaut only ate mac 'n' cheese, or other spurious reflections you'd have at a bar? We did, then collated those deliberations into absurd research articles with fake figures and methodologies inspired by even more fictionally absurd studies.

BookAug 2023230 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages4

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages1

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Mastering Tableau 2023

This book is a comprehensive resource to mastering your Tableau skills and becoming a BI expert. As you progress, you will learn how to build advanced dashboards and improve your storytelling to derive key business insight, as well as make you well-versed with advanced functionalities of Tableau in the business intelligence domain.

BookAug 2023684 pages

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages5

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages2

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

BookOct 2023636 pages5

Modern Data Architecture on AWS

Every organization wants an agile, performant, and cost-effective data platform that meets all their current and future business needs. Purpose-built AWS analytics services and their features play a big part in building such a modern data platform. This book brings to you all the design and architectural patterns that’ll help you achieve this goal.

BookAug 2023420 pages5

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

BookDec 2023240 pages

TinyML Cookbook

With over 70 project-based recipes, the TinyML Cookbook is a practical guide that will help you to get the most out of your microcontrollers. It provides a comprehensive understanding of the theoretical foundations while giving you hands-on experience training ML models for deployment on Arduino Nano 33 BLE Sense, Raspberry Pi Pico, and SparkFun RedBoard Artemis Nano microcontrollers.

BookNov 2023664 pages