Reader small image

You're reading from  Distributed Data Systems with Azure Databricks

Product typeBook
Published inMay 2021
Reading LevelBeginner
PublisherPackt
ISBN-139781838647216
Edition1st Edition
Languages
Concepts
Right arrow
Author (1)
Alan Bernardo Palacio
Alan Bernardo Palacio
author image
Alan Bernardo Palacio

Alan Bernardo Palacio is a data scientist and an engineer with vast experience in different engineering fields. His focus has been the development and application of state-of-the-art data products and algorithms in several industries. He has worked for companies such as Ernst and Young, Globant, and now holds a data engineer position at Ebiquity Media helping the company to create a scalable data pipeline. Alan graduated with a Mechanical Engineering degree from the National University of Tucuman in 2015, participated as the founder in startups, and later on earned a Master's degree from the faculty of Mathematics in the Autonomous University of Barcelona in 2017. Originally from Argentina, he now works and resides in the Netherlands.
Read more about Alan Bernardo Palacio

Right arrow

Chapter 10: Model Tracking and Tuning in Azure Databricks

In the previous chapter, we learned how to create machine learning and deep learning models, as well as how to load datasets during distributed training in Azure Databricks. Finding the right machine learning algorithm to solve a problem using machine learning is one thing, but finding the best hyperparameters is another equally or more complex task. In this chapter, we will focus on model tuning, deployment, and control by using MLflow as a Model Repository. We will also use Hyperopt to search for the best set of hyperparameters for our models. We will implement the use of these libraries using deep learning models that have been made using the scikit-learn Python library.

More concretely, we will learn how to track runs of the machine learning model's training to find the most optimal set of hyperparameters, deploy and manage version control for the models using MLflow, and learn how to use Hyperopt as one of the...

Technical requirements

To work on the examples in this chapter, you must have the following:

  • An Azure Databricks subscription.
  • An Azure Databricks notebook attached to a running cluster with Databricks Runtime ML version 7.0 or higher.

Tuning hyperparameters with AutoML

In machine learning and deep learning, hyperparameter tuning is the process in which we select a set of optimal hyperparameters that will be used by our learning algorithm. Here, hyperparameters are values that are used to control the learning process. In contrast, other parameters will be learned from the data. In this sense, a hyperparameter is a concept that follows its statistical meaning; that is, it's a parameter from a prior distribution that captures the prior belief before we start to learn from the data.

In machine learning and deep learning, it is also common to call hyperparameters the parameters that are set before we start to train our model. These parameters will control the training process. Some examples of hyperparameters that are used in deep learning are as follows:

  • Learning rate
  • Number of epochs
  • Hidden layers
  • Hidden units
  • Activation functions

These parameters will directly influence the performance...

Automating model tracking with MLflow

As we mentioned previously, MLflow is an open-source platform for managing machine and deep learning model life cycles, which allows us to perform experiments, ensure reproducibility, and support easy model deployment. It also provides us with a centralized model registry. As a general overview, the components of MLflow are as follows:

  • MLflow Tracking: It records all data associated with an experiment, such as code, data, configuration, and results.
  • MLflow Projects: It wraps the code in a format that ensures the results can be reproduced between runs, regardless of the platform.
  • MLflow Models: This provides us with a deployment platform for our machine learning and deep learning models.
  • Model Registry: The central repository for our machine learning and deep learning models.

In this section, we will focus on MLflow Tracking, which is the component that allows us to log and register the code, properties, hyperparameters...

Hyperparameter tuning with Hyperopt

The Azure Databricks Runtime for Machine Learning includes Hyperopt, a Python library that is intended to be used on distributed computing systems to facilitate the learning process for an optimal set of hyperparameters. At its core, it's a library that receives a function that we need to either minimize or maximize, and a set of parameters that define the search space. With this information, Hyperopt will explore the search space to find the optimal set of hyperparameters. Hyperopt uses a stochastic search algorithm to explore the search space, which is much more efficient than using a traditional deterministic approach such as random search or grid search.

Hyperopt is optimized for use in distributed computing environments and provides support for libraries such as PySpark MLlib and Horovord, the latter of which is a library for distributed deep learning training that we will focus on later. It can also be applied in single-machine environments...

Optimizing model selection with scikit-learn, Hyperopt, and MLflow

As we saw in the previous sections, Hyperopt is a Python library that allows us to track optimization runs that can be used for hyperparameter model tuning distributed computing environments such as Azure Databricks. In this section, we will go through an example of training a scikit-learn model. We will use Hyperopt to track the tuning process and log the results to MLflow, the model life cycle management platform.

In Azure Databricks Runtime for Machine Learning, we have an optimized version of Hyperopt at our disposal that supports MLflow tracking. Here, we can use the SparkTrials objects to log the results of the tuning process of single-machine models during parallel executions. We will use these tools to find the best set of hyperparameters for several scikit-learn models.

We will do the following:

  • Prepare the training dataset.
  • Use Hyperopt to define the objective function to be minimized.
  • ...

Summary

In this chapter, we learned about some of the valuable features of Azure Databricks that allow us to track training runs, as well as find the optimal set of hyperparameters of machine learning models, using the MLflow Model Registry. We have also learned how we can optimize how we scan the search space of optimal parameters using Hyperopt. This is a great set of tools because we can fine-tune models that have complete tracking for the hyperparameters that are used for training. We also explored a defined search space of hyperparameters using adaptative search strategies, which are much more optimized than the common grid and random search strategies.

In the next chapter, we will explore how to use the MLflow Model Registry, which is integrated into Azure Databricks. MLflow makes it easier to keep track of the entire life cycle of a machine learning model and all the associated parameters and artifacts used in the training process, but it also allows us to deploy these models...

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Distributed Data Systems with Azure Databricks
Published in: May 2021Publisher: PacktISBN-13: 9781838647216
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Alan Bernardo Palacio

Alan Bernardo Palacio is a data scientist and an engineer with vast experience in different engineering fields. His focus has been the development and application of state-of-the-art data products and algorithms in several industries. He has worked for companies such as Ernst and Young, Globant, and now holds a data engineer position at Ebiquity Media helping the company to create a scalable data pipeline. Alan graduated with a Mechanical Engineering degree from the National University of Tucuman in 2015, participated as the founder in startups, and later on earned a Master's degree from the faculty of Mathematics in the Autonomous University of Barcelona in 2017. Originally from Argentina, he now works and resides in the Netherlands.
Read more about Alan Bernardo Palacio