Reader small image

You're reading from  Practical Machine Learning on Databricks

Product typeBook
Published inNov 2023
Reading LevelIntermediate
PublisherPackt
ISBN-139781801812030
Edition1st Edition
Languages
Concepts
Right arrow
Author (1)
Debu Sinha
Debu Sinha
author image
Debu Sinha

Debu is an experienced Data Science and Engineering leader with deep expertise in Software Engineering and Solutions Architecture. With over 10 years in the industry, Debu has a proven track record in designing scalable Software Applications, Big Data, and Machine Learning systems. As Lead ML Specialist on the Specialist Solutions Architect team at Databricks, Debu focuses on AI/ML use cases in the cloud and serves as an expert on LLMs, Machine Learning, and MLOps. With prior experience as a startup co-founder, Debu has demonstrated skills in team-building, scaling, and delivering impactful software solutions. An established thought leader, Debu has received multiple awards and regularly speaks at industry events.
Read more about Debu Sinha

Right arrow

Understanding MLflow Components on Databricks

In the previous chapter, we learned about Feature Store, what problem it solves, and how Databricks provides the built-in Feature Store as part of the Databricks machine learning (ML) workspace, which we can use to register our feature tables.

In this chapter, we will look into managing our model training, tracking, and experimentation. In a software engineer’s world, code development and productionization have established best practices; however, such best practices are not generally adopted in the ML engineering/data science world. While working with many Databricks customers, I observed that each data science team has its own way of managing its projects. This is where MLflow comes in.

MLflow is an umbrella project developed at Databricks, by Databricks engineers, to bring a standardized ML life cycle management tool to the Databricks platform. It is now an open source project with more than 500,000 daily downloads on average...

Technical requirements

All the code is available in this book’s GitHub repository https://github.com/PacktPublishing/Practical-Machine-Learning-on-Databricks and is self-contained. To execute the notebooks, you can import the code repository directly into your Databricks workspace using repos. We discussed repos in our previous chapters.

This chapter also assumes that you have a preliminary understanding of what user-defined functions are in Apache Spark. You can read more about them here: https://docs.databricks.com/en/udf/index.html.

Overview of MLflow

The ML life cycle is complex. It starts with ingesting raw data into the data/Delta lake in raw format from various batch and streaming sources. The data engineers create data pipelines using tools such as Apache Spark with Python, R, SQL, or Scala to process a large amount of data in a scalable, performant, and cost-effective manner.

The data scientists then utilize the various curated datasets in the data lake to generate feature tables to train their ML models. The data scientists prefer programming languages such as Python and R for feature engineering and libraries such as scikit-learn, pandas, NumPy, PyTorch, or any other popular ML or deep learning libraries for training and tuning ML models.

Once the models have been trained, they need to be deployed in production either as a representational state transfer (REST) application programming interface (API) for real-time inference, or a user-defined function (UDF) for batch and stream inference on Apache...

MLflow Tracking

MLflow Tracking allows you to track the training of your ML models. It also improves the observability of the model-training process. The MLflow Tracking feature allows you to log the generated metrics, artifacts, and the model itself as part of the model training process. MLflow Tracking also keeps track of model lineage in the Databricks environment. In Databricks, we can see the exact version of the notebook responsible for generating the model listed as the source.

MLflow also provides automatic logging (autolog) capabilities that automatically log many metrics, parameters, and artifacts while performing model training. We can also add our own set of metrics and artifacts to the log.

Using MLflow Tracking, we can chronologically track model training. Certain terms are specific to MLflow Tracking. Let’s take a look at them:

  • Experiments: Training and tuning the ML model for a business problem is an experiment. By default, each Python notebook...

MLflow Models

MLflow Models is a standard packaging format for ML models. It provides a standardized abstraction on top of the ML model created by the data scientists. Each MLflow model is essentially a directory containing an MLmodel file in the directory’s root that can define multiple flavors that the model can be viewed in.

Flavors represent a fundamental concept that empowers MLflow Models by providing a standardized approach for deployment tools to comprehend and interact with ML models. This innovation eliminates the need for each deployment tool to integrate with every ML library individually. MLflow introduces several “standard” flavors, universally supported by its built-in deployment tools. For instance, the “Python function” flavor outlines how to execute the model as a Python function. However, the versatility of flavors extends beyond these standards. Libraries have the flexibility to define and employ their own flavors. As an example...

MLflow Model Registry

MLflow Model Registry is a tool that collaboratively manages the life cycle of all the MLflow Models in a centralized manner across an organization. In Databricks, the integrated Model Registry provides granular access control over who can transition models from one stage to another.

MLflow Model Registry allows multiple versions of the models in a particular stage. It enables the transition of the best-suited model between staging, prod, and archived states either programmatically or by a human-in-the-loop deployment model. Choosing one strategy over another for model deployment will depend on the use case and how comfortable teams are in automating the entire process of managing ML model promotion and testing process. We will take a deeper look into this in Chapter 6, Model Versioning and Webhooks.

Model Registry also logs model descriptions, lineage, and promotion activity from one stage to another, providing full traceability.

We will look into the...

Example code showing how to track ML model training in Databricks

Before proceeding, it’s important to ensure that you’ve already cloned the code repository that accompanies this book, as outlined in Chapter 3. Additionally, please verify that you have executed the associated notebook for Chapter 3. These preparatory steps are essential to fully engage with the content and exercises presented here:

  1. Go to Chapter 04 and click on the mlflow-without-featurestore notebook:
Figure 4.3 – The code that accompanies this chapter

Figure 4.3 – The code that accompanies this chapter

Make sure you have a cluster up and running and that the cluster is attached to this notebook, as you did with the notebook from Chapter 3, Utilizing the Feature Store.

  1. Cmd 3 demonstrates the use of notebook-scoped libraries. These can be installed using the %pip magic command. As best practice, keep the %pip command as one of the topmost cells in your notebook as it restarts the Python interpreter...

Summary

In this chapter, we covered the various components of MLflow and how they work together to make the end-to-end ML project life cycle easy to manage. We learned about MLflow Tracking, Projects, Models, and Model Registry.

This chapter covered some key components of MLFlow and their purpose. Understanding these concepts is essential in effectively managing end-to-end ML projects in the Databricks environment.

In the next chapter, we will look at the AutoML capabilities of Databricks in detail and how we can utilize them to create our baseline models for ML projects.

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Practical Machine Learning on Databricks
Published in: Nov 2023Publisher: PacktISBN-13: 9781801812030
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Debu Sinha

Debu is an experienced Data Science and Engineering leader with deep expertise in Software Engineering and Solutions Architecture. With over 10 years in the industry, Debu has a proven track record in designing scalable Software Applications, Big Data, and Machine Learning systems. As Lead ML Specialist on the Specialist Solutions Architect team at Databricks, Debu focuses on AI/ML use cases in the cloud and serves as an expert on LLMs, Machine Learning, and MLOps. With prior experience as a startup co-founder, Debu has demonstrated skills in team-building, scaling, and delivering impactful software solutions. An established thought leader, Debu has received multiple awards and regularly speaks at industry events.
Read more about Debu Sinha