Search icon
Arrow left icon
All Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletters
Free Learning
Arrow right icon
Machine Learning Engineering with MLflow

You're reading from  Machine Learning Engineering with MLflow

Product type Book
Published in Aug 2021
Publisher Packt
ISBN-13 9781800560796
Pages 248 pages
Edition 1st Edition
Languages
Author (1):
Natu Lauchande Natu Lauchande
Profile icon Natu Lauchande

Table of Contents (18) Chapters

Preface Section 1: Problem Framing and Introductions
Chapter 1: Introducing MLflow Chapter 2: Your Machine Learning Project Section 2: Model Development and Experimentation
Chapter 3: Your Data Science Workbench Chapter 4: Experiment Management in MLflow Chapter 5: Managing Models with MLflow Section 3: Machine Learning in Production
Chapter 6: Introducing ML Systems Architecture Chapter 7: Data and Feature Management Chapter 8: Training Models with MLflow Chapter 9: Deployment and Inference with MLflow Section 4: Advanced Topics
Chapter 10: Scaling Up Your Machine Learning Workflow Chapter 11: Performance Monitoring Chapter 12: Advanced Topics with MLflow Other Books You May Enjoy

Getting started with MLflow

Next, we will install MLflow on your machine and prepare it for use in this chapter. You will have two options when it comes to installing MLflow. The first option is through a Docker container-based recipe provided in the repository of the book: https://github.com/PacktPublishing/Machine-Learning-Engineering-with-Mlflow.git.

To install it, follow these instructions:

  1. Use the following commands to install the software:
    $ git clone https://github.com/PacktPublishing/Machine-Learning-Engineering-with-Mlflow.git
    $ cd Machine-Learning-Engineering-with-Mlflow
    $ cd Chapter01
  2. The Docker image is very simple at this stage: it simply contains MLflow and sklearn, the main tools to be used in this chapter of the book. For illustrative purposes, you can look at the content of the Dockerfile:
    FROM jupyter/scipy-notebook
    RUN pip install mlflow
    RUN pip install sklearn
  3. To build the image, you should now run the following command:
    docker build -t chapter_1_homlflow
  4. Right after building the image, you can run the ./run.sh command:
    ./run.sh

    Important note

    It is important to ensure that you have the latest version of Docker installed on your machine.

  5. Open your browser to http://localhost:888 and you should be able to navigate to the Chapter01 folder.

In the following section, we will be developing our first model with MLflow in the Jupyter environment created in the previous set of steps.

Developing your first model with MLflow

From the point of view of simplicity, in this section, we will use the built-in sample datasets in sklearn, the ML library that we will use initially to explore MLflow features. For this section, we will choose the famous Iris dataset to train a multi-class classifier using MLflow.

The Iris dataset (one of sklearn's built-in datasets available from https://scikit-learn.org/stable/datasets/toy_dataset.html) contains the following elements as features: sepal length, sepal width, petal length, and petal width. The target variable is the class of the iris: Iris Setosa, Iris Versocoulor, or Iris Virginica:

  1. Load the sample dataset:
    from sklearn import datasets
    from sklearn.model_selection import train_test_split
    dataset = datasets.load_iris()
    X_train, X_test, y_train, y_test = train_test_split(dataset.data, dataset.target, test_size=0.4)
  2. Next, let's train your model.

    Training a simple machine model with a framework such as scikit-learn involves instantiating an estimator such as LogisticRegression and calling the fit command to execute training over the Iris dataset built in scikit-learn:

    from sklearn.linear_model import LogisticRegression
    clf = LogisticRegression()
    clf.fit(X_train, y_train)

    The preceding lines of code are just a small portion of the ML Engineering process. As will be demonstrated, a non-trivial amount of code needs to be created in order to productionize and make sure that the preceding training code is usable and reliable. One of the main objectives of MLflow is to aid in the process of setting up ML systems and projects. In the following sections, we will demonstrate how MLflow can be used to make your solutions robust and reliable.

  3. Then, we will add MLflow.

    With a few more lines of code, you should be able to start your first MLflow interaction. In the following code listing, we start by importing the mlflow module, followed by the LogisticRegression class in scikit-learn. You can use the accompanying Jupyter notebook to run the next section:

    import mlflow
    from sklearn.linear_model import LogisticRegression
    mlflow.sklearn.autolog()
    with mlflow.start_run():
        clf = LogisticRegression()
        clf.fit(X_train, y_train)

    The mlflow.sklearn.autolog() instruction enables you to automatically log the experiment in the local directory. It captures the metrics produced by the underlying ML library in use. MLflow Tracking is the module responsible for handling metrics and logs. By default, the metadata of an MLflow run is stored in the local filesystem.

  4. If you run the following excerpt on the accompanying notebook's root document, you should now have the following files in your home directory as a result of running the following command:
    $ ls -l 
    total 24
    -rw-r--r-- 1 jovyan users 12970 Oct 14 16:30 chapther_01_introducing_ml_flow.ipynb
    -rw-r--r-- 1 jovyan users    53 Sep 30 20:41 Dockerfile
    drwxr-xr-x 4 jovyan users   128 Oct 14 16:32 mlruns
    -rwxr-xr-x 1 jovyan users    97 Oct 14 13:20 run.sh

    The mlruns folder is generated alongside your notebook folder and contains all the experiments executed by your code in the current context.

    The mlruns folder will contain a folder with a sequential number identifying your experiment. The outline of the folder will appear as follows:

    ├── 46dc6db17fb5471a9a23d45407da680f
    │   ├── artifacts
    │   │   └── model
    │   │       ├── MLmodel
    │   │       ├── conda.yaml
    │   │       ├── input_example.json
    │   │       └── model.pkl
    │   ├── meta.yaml
    │   ├── metrics
    │   │   └── training_score
    │   ├── params
    │   │   ├── C
    │   │   …..
    │   └── tags
    │       ├── mlflow.source.type
    │       └── mlflow.user
    └── meta.yaml

    So, with very little effort, we have a lot of traceability available to us, and a good foundation to improve upon.

Your experiment is identified as UUID on the preceding sample by 46dc6db17fb5471a9a23d45407da680f. At the root of the directory, you have a yaml file named meta.yaml, which contains the content:

artifact_uri: file:///home/jovyan/mlruns/0/518d3162be7347298abe4c88567ca3e7/artifacts
end_time: 1602693152677
entry_point_name: ''
experiment_id: '0'
lifecycle_stage: active
name: ''
run_id: 518d3162be7347298abe4c88567ca3e7
run_uuid: 518d3162be7347298abe4c88567ca3e7
source_name: ''
source_type: 4
source_version: ''
start_time: 1602693152313
status: 3
tags: []
user_id: jovyan

This is the basic metadata of your experiment, with information including start time, end time, identification of the run (run_id and run_uuid), an assumption of the life cycle stage, and the user who executed the experiment. The settings are basically based on a default run, but provide valuable and readable information regarding your experiment:

├── 46dc6db17fb5471a9a23d45407da680f
│   ├── artifacts
│   │   └── model
│   │       ├── MLmodel
│   │  ^   ├── conda.yaml
│   │       ├── input_example.json
│   │       └── model.pkl

The model.pkl file contains a serialized version of the model. For a scikit-learn model, there is a binary version of the Python code of the model. Upon autologging, the metrics are leveraged from the underlying machine library in use. The default packaging strategy was based on a conda.yaml file, with the right dependencies to be able to serialize the model.

The MLmodel file is the main definition of the project from an MLflow project with information related to how to run inference on the current model.

The metrics folder contains the training score value of this particular run of the training process, which can be used to benchmark the model with further model improvements down the line.

The params folder on the first listing of folders contains the default parameters of the logistic regression model, with the different default possibilities listed transparently and stored automatically.

You have been reading a chapter from
Machine Learning Engineering with MLflow
Published in: Aug 2021 Publisher: Packt ISBN-13: 9781800560796
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime}