You're reading from Practical Machine Learning on Databricks

Product typeBook

Published inNov 2023

Reading LevelIntermediate

PublisherPackt

ISBN-139781801812030

Edition1st Edition

Languages

Python

Concepts

Data Science

Author (1)

Debu Sinha

Understanding MLflow Components on Databricks

In the previous chapter, we learned about Feature Store, what problem it solves, and how Databricks provides the built-in Feature Store as part of the Databricks machine learning (ML) workspace, which we can use to register our feature tables.

In this chapter, we will look into managing our model training, tracking, and experimentation. In a software engineer’s world, code development and productionization have established best practices; however, such best practices are not generally adopted in the ML engineering/data science world. While working with many Databricks customers, I observed that each data science team has its own way of managing its projects. This is where MLflow comes in.

MLflow is an umbrella project developed at Databricks, by Databricks engineers, to bring a standardized ML life cycle management tool to the Databricks platform. It is now an open source project with more than 500,000 daily downloads on average...

Technical requirements

All the code is available in this book’s GitHub repository https://github.com/PacktPublishing/Practical-Machine-Learning-on-Databricks and is self-contained. To execute the notebooks, you can import the code repository directly into your Databricks workspace using repos. We discussed repos in our previous chapters.

This chapter also assumes that you have a preliminary understanding of what user-defined functions are in Apache Spark. You can read more about them here: https://docs.databricks.com/en/udf/index.html.

Overview of MLflow

The ML life cycle is complex. It starts with ingesting raw data into the data/Delta lake in raw format from various batch and streaming sources. The data engineers create data pipelines using tools such as Apache Spark with Python, R, SQL, or Scala to process a large amount of data in a scalable, performant, and cost-effective manner.

The data scientists then utilize the various curated datasets in the data lake to generate feature tables to train their ML models. The data scientists prefer programming languages such as Python and R for feature engineering and libraries such as scikit-learn, pandas, NumPy, PyTorch, or any other popular ML or deep learning libraries for training and tuning ML models.

Once the models have been trained, they need to be deployed in production either as a representational state transfer (REST) application programming interface (API) for real-time inference, or a user-defined function (UDF) for batch and stream inference on Apache...

MLflow Tracking

MLflow Tracking allows you to track the training of your ML models. It also improves the observability of the model-training process. The MLflow Tracking feature allows you to log the generated metrics, artifacts, and the model itself as part of the model training process. MLflow Tracking also keeps track of model lineage in the Databricks environment. In Databricks, we can see the exact version of the notebook responsible for generating the model listed as the source.

MLflow also provides automatic logging (autolog) capabilities that automatically log many metrics, parameters, and artifacts while performing model training. We can also add our own set of metrics and artifacts to the log.

Using MLflow Tracking, we can chronologically track model training. Certain terms are specific to MLflow Tracking. Let’s take a look at them:

Experiments: Training and tuning the ML model for a business problem is an experiment. By default, each Python notebook...

MLflow Models

MLflow Models is a standard packaging format for ML models. It provides a standardized abstraction on top of the ML model created by the data scientists. Each MLflow model is essentially a directory containing an MLmodel file in the directory’s root that can define multiple flavors that the model can be viewed in.

Flavors represent a fundamental concept that empowers MLflow Models by providing a standardized approach for deployment tools to comprehend and interact with ML models. This innovation eliminates the need for each deployment tool to integrate with every ML library individually. MLflow introduces several “standard” flavors, universally supported by its built-in deployment tools. For instance, the “Python function” flavor outlines how to execute the model as a Python function. However, the versatility of flavors extends beyond these standards. Libraries have the flexibility to define and employ their own flavors. As an example...

MLflow Model Registry

MLflow Model Registry is a tool that collaboratively manages the life cycle of all the MLflow Models in a centralized manner across an organization. In Databricks, the integrated Model Registry provides granular access control over who can transition models from one stage to another.

MLflow Model Registry allows multiple versions of the models in a particular stage. It enables the transition of the best-suited model between staging, prod, and archived states either programmatically or by a human-in-the-loop deployment model. Choosing one strategy over another for model deployment will depend on the use case and how comfortable teams are in automating the entire process of managing ML model promotion and testing process. We will take a deeper look into this in Chapter 6, Model Versioning and Webhooks.

Model Registry also logs model descriptions, lineage, and promotion activity from one stage to another, providing full traceability.

We will look into the...

Example code showing how to track ML model training in Databricks

Before proceeding, it’s important to ensure that you’ve already cloned the code repository that accompanies this book, as outlined in Chapter 3. Additionally, please verify that you have executed the associated notebook for Chapter 3. These preparatory steps are essential to fully engage with the content and exercises presented here:

Go to Chapter 04 and click on the mlflow-without-featurestore notebook:

Figure 4.3 – The code that accompanies this chapter

Make sure you have a cluster up and running and that the cluster is attached to this notebook, as you did with the notebook from Chapter 3, Utilizing the Feature Store.

Cmd 3 demonstrates the use of notebook-scoped libraries. These can be installed using the %pip magic command. As best practice, keep the %pip command as one of the topmost cells in your notebook as it restarts the Python interpreter...

Summary

In this chapter, we covered the various components of MLflow and how they work together to make the end-to-end ML project life cycle easy to manage. We learned about MLflow Tracking, Projects, Models, and Model Registry.

This chapter covered some key components of MLFlow and their purpose. Understanding these concepts is essential in effectively managing end-to-end ML projects in the Databricks environment.

In the next chapter, we will look at the AutoML capabilities of Databricks in detail and how we can utilize them to create our baseline models for ML projects.

The rest of the chapter is locked

You have been reading a chapter from

Practical Machine Learning on Databricks

Published in: Nov 2023Publisher: PacktISBN-13: 9781801812030

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at $15.99/month. Cancel anytime

Author (1)

Debu Sinha

Debu is an experienced Data Science and Engineering leader with deep expertise in Software Engineering and Solutions Architecture. With over 10 years in the industry, Debu has a proven track record in designing scalable Software Applications, Big Data, and Machine Learning systems. As Lead ML Specialist on the Specialist Solutions Architect team at Databricks, Debu focuses on AI/ML use cases in the cloud and serves as an expert on LLMs, Machine Learning, and MLOps. With prior experience as a startup co-founder, Debu has demonstrated skills in team-building, scaling, and delivering impactful software solutions. An established thought leader, Debu has received multiple awards and regularly speaks at industry events.
Read more about Debu Sinha

Personalised recommendations for you

Based on your interests and search pattern

Et al.

Ever wonder why speech recognition systems don't understand the Scottish accent, or what would happen if an astronaut only ate mac 'n' cheese, or other spurious reflections you'd have at a bar? We did, then collated those deliberations into absurd research articles with fake figures and methodologies inspired by even more fictionally absurd studies.

BookAug 2023230 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages4

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages1

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Mastering Tableau 2023

This book is a comprehensive resource to mastering your Tableau skills and becoming a BI expert. As you progress, you will learn how to build advanced dashboards and improve your storytelling to derive key business insight, as well as make you well-versed with advanced functionalities of Tableau in the business intelligence domain.

BookAug 2023684 pages

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages5

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages2

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

BookOct 2023636 pages5

Modern Data Architecture on AWS

Every organization wants an agile, performant, and cost-effective data platform that meets all their current and future business needs. Purpose-built AWS analytics services and their features play a big part in building such a modern data platform. This book brings to you all the design and architectural patterns that’ll help you achieve this goal.

BookAug 2023420 pages5

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

BookDec 2023240 pages

TinyML Cookbook

With over 70 project-based recipes, the TinyML Cookbook is a practical guide that will help you to get the most out of your microcontrollers. It provides a comprehensive understanding of the theoretical foundations while giving you hands-on experience training ML models for deployment on Arduino Nano 33 BLE Sense, Raspberry Pi Pico, and SparkFun RedBoard Artemis Nano microcontrollers.

BookNov 2023664 pages