Reader small image

You're reading from  Practical Machine Learning on Databricks

Product typeBook
Published inNov 2023
Reading LevelIntermediate
PublisherPackt
ISBN-139781801812030
Edition1st Edition
Languages
Concepts
Right arrow
Author (1)
Debu Sinha
Debu Sinha
author image
Debu Sinha

Debu is an experienced Data Science and Engineering leader with deep expertise in Software Engineering and Solutions Architecture. With over 10 years in the industry, Debu has a proven track record in designing scalable Software Applications, Big Data, and Machine Learning systems. As Lead ML Specialist on the Specialist Solutions Architect team at Databricks, Debu focuses on AI/ML use cases in the cloud and serves as an expert on LLMs, Machine Learning, and MLOps. With prior experience as a startup co-founder, Debu has demonstrated skills in team-building, scaling, and delivering impactful software solutions. An established thought leader, Debu has received multiple awards and regularly speaks at industry events.
Read more about Debu Sinha

Right arrow

Create a Baseline Model Using Databricks AutoML

In the last chapter, we understood MLflow and all its components. After running the notebook from Chapter 4, Understanding MLflow Components on Databricks, you might have recognized how easy it actually is to start tracking your ML model training in Databricks using the integrated MLflow tracking server. In this chapter, we will cover another new and unique feature of Databricks called AutoML.

Databricks AutoML, like all the other features that are part of the Databricks workspace, is fully integrated with MLflow features and the Feature Store.

Databricks AutoML, at the time of writing of this book, supports classification, regression, and forecasting use cases using traditional ML algorithms and not deep learning. You can see a list of supported algorithms in the second section of the chapter.

You can use AutoML with a table registered in Databricks’ Hive metastore, feature tables, or even upload a new file using the...

Technical requirements

To go through the chapter, we’ll need the following requirements:

  • we'll need the execution of the notebooks pertaining to Chapter 3, which involves the ingestion of raw data from a CSV file into a Delta table and the subsequent registration of a new feature table, to have already been completed.

Understanding the need for AutoML

If you have never worked with any AutoML framework before, you might be wondering what AutoML is and when and how it can be useful.

AutoML simplifies the machine learning model development process by automating various tasks. It automatically generates baseline models tailored to your specific datasets and even offers preconfigured notebooks to kickstart your projects. This is particularly appealing to data scientists of all levels of expertise because it saves valuable time in the initial stages of model development. Instead of manually crafting models from scratch, AutoML provides a quick and efficient way to obtain baseline models, making it a valuable tool for both beginners and experienced data scientists alike.

AutoML makes machine learning not only accessible to citizen data scientists and business subject matter experts. AutoML, while undoubtedly a powerful tool, also grapples with significant limitations. One notable challenge is its...

Understanding AutoML in Databricks

Databricks AutoML uses a glass-box approach to AutoML. When you use Databricks AutoML either through the UI or through the supported Python API, it logs every combination of model and hyperparameter (trial) as an MLflow run and generates Python notebooks with source code corresponding to each model trial. The results of all these model trials are logged into the MLflow tracking server. Each of the trials can be compared and reproduced. Since you have access to the source code, the data scientists can easily rerun a trial after modifying the code. We will look at this in more detail when we go over the example.

Databricks AutoML also prepares the dataset for training and then performs model training and hyperparameter tuning on the Databricks cluster. One important thing to keep in mind here is that Databricks AutoML spreads hyperparameter tuning trials across the cluster. A trial is a unique configuration of hyperparameters associated with the...

Running AutoML on our churn prediction dataset

Let’s take a look at how to use Databricks AutoML with our bank customer churn prediction dataset.

If you executed the notebooks from Chapter 3, Utilizing the Feature Store, you will have raw data available as a Delta table in your Hive metastore. It has the name raw_data. In the Chapter 3 code, we read a CSV file from our Git repository with raw data, wrote that as a Delta table, and registered it in our integrated metastore. Take a look at cmd 15 in your notebook. In your environment, the dataset can be coming from another data pipeline or uploaded directly to the Databricks workspace using the Upload file functionality.

To view the tables, you need to have your cluster up and running.

Figure 5.1 – The location of the raw dataset

Figure 5.1 – The location of the raw dataset

Let’s create our first Databricks AutoML experiment.

Important note

Make sure that before following the next steps, you have a cluster up and running...

Summary

In this chapter, we covered the importance of AutoML and how it can help data scientists get started and become productive with the problem at hand. We then covered the Databricks AutoML glassbox approach, which makes it easy to interpret model results and automatically capture lineage. We also learned how Databricks AutoML is integrated with the MLflow tracking server within the Databricks workspace.

In the next chapters, we will go over managing your ML model’s life cycle using the MLflow model registry and Webhooks in more detail.

Further reading

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Practical Machine Learning on Databricks
Published in: Nov 2023Publisher: PacktISBN-13: 9781801812030
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Debu Sinha

Debu is an experienced Data Science and Engineering leader with deep expertise in Software Engineering and Solutions Architecture. With over 10 years in the industry, Debu has a proven track record in designing scalable Software Applications, Big Data, and Machine Learning systems. As Lead ML Specialist on the Specialist Solutions Architect team at Databricks, Debu focuses on AI/ML use cases in the cloud and serves as an expert on LLMs, Machine Learning, and MLOps. With prior experience as a startup co-founder, Debu has demonstrated skills in team-building, scaling, and delivering impactful software solutions. An established thought leader, Debu has received multiple awards and regularly speaks at industry events.
Read more about Debu Sinha