Reader small image

You're reading from  Practical Machine Learning on Databricks

Product typeBook
Published inNov 2023
Reading LevelIntermediate
PublisherPackt
ISBN-139781801812030
Edition1st Edition
Languages
Concepts
Right arrow
Author (1)
Debu Sinha
Debu Sinha
author image
Debu Sinha

Debu is an experienced Data Science and Engineering leader with deep expertise in Software Engineering and Solutions Architecture. With over 10 years in the industry, Debu has a proven track record in designing scalable Software Applications, Big Data, and Machine Learning systems. As Lead ML Specialist on the Specialist Solutions Architect team at Databricks, Debu focuses on AI/ML use cases in the cloud and serves as an expert on LLMs, Machine Learning, and MLOps. With prior experience as a startup co-founder, Debu has demonstrated skills in team-building, scaling, and delivering impactful software solutions. An established thought leader, Debu has received multiple awards and regularly speaks at industry events.
Read more about Debu Sinha

Right arrow

Overview of ML on Databricks

This chapter will give you a fundamental understanding of how to get started with ML on Databricks. The ML workspace is data scientist-friendly and allows rapid ML development by providing out-of-the-box support for popular ML libraries such as TensorFlow, PyTorch, and many more.

We will cover setting up a trial Databricks account and learn about the various ML-specific features available at ML practitioners’ fingertips in the Databricks workspace. You will learn how to start a cluster on Databricks and create a new notebook.

In this chapter, we will cover these main topics:

  • Setting up a Databricks trial account
  • Introduction to the ML workspace on Databricks
  • Exploring the workspace
  • Exploring clusters
  • Exploring notebooks
  • Exploring data
  • Exploring experiments
  • Discovering the feature store
  • Discovering the model registry
  • Libraries

These topics will cover the essential features to perform effective...

Technical requirements

For this chapter, you’ll need access to the Databricks workspace with cluster creation privileges. By default, the owner of the workspace has permission to create clusters. We will cover clusters in more detail in the Exploring clusters sections. You can read more about the various cluster access control options here: https://docs.databricks.com/security/access-control/cluster-acl.html.

Setting up a Databricks trial account

At the time of writing, Databricks is available on all the major cloud platforms, namely Google Cloud Platform (GCP), Microsoft Azure, and Amazon Web Services (AWS).

Databricks provides an easy way to either create an account within the community edition or start a 14-day trial with all the enterprise features available in the workspace.

To fully leverage the code examples provided in this book and explore the enterprise features we’ll cover, I highly recommend taking advantage of the 14-day trial option. This trial will grant you access to all the necessary functionalities, ensuring a seamless experience throughout your learning journey.

Please go through this link to sign up for trial account: https://www.databricks.com/try-databricks?itm_data=PricingPage-Trial#account

On filling out the introductory form, you will be redirected to a page that will provide you with options to start with trial deployments on either of the three...

Exploring the workspace

The workspace is within a Databricks ML environment. Each user of the Databricks ML environment will have a workspace. Users can create notebooks and develop code in isolation or collaborate with other teammates through granular access controls. You will be working within the workspace or repos for most of your time in the Databricks environment. We will learn more about repos in the Repos section:

Figure 2.3 – The Workspace tab

Figure 2.3 – The Workspace tab

It’s important to note that the Workspace area is primarily intended for Databricks notebooks. While the workspace does support version control for notebooks using Git providers within Databricks, it’s worth highlighting that this version control capability within workspace notebooks is now considered less recommended compared to using repos.

Version control, in the context of software development, is a system that helps track changes made to files over time. It allows you to maintain...

Exploring clusters

Clusters are the primary computing units that will do the heavy lifting when you’re training your ML models. The VMs associated with a cluster are provisioned in Databricks users’ cloud subscriptions; however, the Databricks UI provides an interface to control the cluster type and its settings.

Clusters are ephemeral compute resources. No data is stored on clusters:

Figure 2.6 – The Clusters tab

Figure 2.6 – The Clusters tab

The Pools feature allows end users to create Databricks VM pools. One of the benefits of working in the cloud environment is that you can request new compute resources on demand. The end user pays by the second and returns the compute once the load on the cluster is low. This is great; however, requesting a VM from the cloud provider, ramping it up, and adding it to a cluster still takes some time. Using pools, you can pre-provision VMs to keep them in a standby state. If a cluster requests new nodes and has access...

Exploring notebooks

If you are familiar with Jupyter and IPython notebooks, then Databricks notebooks will look very familiar. A Databricks notebook development environment consists of cells where end users can interactively write code in R, Python, Scala, or SQL.

Databricks notebooks also have additional functionalities such as integration with the Spark UI, powerful integrated visualizations, version control, and an MLflow model tracking server. We can also parameterize a notebook and pass parameters to it at execution time.

We will cover notebooks in more detail as the code examples presented to you in this book utilize the Databricks notebook environment. Additional details about notebooks can be found at https://docs.databricks.com/notebooks/index.html:

Figure 2.9 – Databricks notebooks

Figure 2.9 – Databricks notebooks

Let’s take a look at the next feature on the Data tab, also called the Databricks metastore.

Exploring data

By default, when a new workspace is deployed, it comes with a managed Hive metastore. A metastore allows you to register datasets in various formats such as Comma-Separated Values (CSV), Parquet, Delta format, text, or JavaScript Object Notation (JSON) as an external table (https://docs.databricks.com/data/metastores/index.html). We will not go too much into detail about the metastore here:

Figure 2.10 – The Data tab

Figure 2.10 – The Data tab

It’s all right if you are not familiar with the term metastore. In simple terms, it is similar to a relational database. In relational databases, there are databases and then table names and schemas. The end user can use SQL to interact with the data stored in databases and tables. Similarly, in Databricks, end users can decide to register datasets stored in cloud storage so that they’re available as tables. You can learn more here: https://docs.databricks.com/spark/latest/spark-sql/language-manual/index...

Exploring experiments

As the name suggests, experiments are the central location where all the model training pertinent to business problems can be accessed. Users can define their name for the experiment or a default system-generated one and use it to train the different ML model training runs. Experiments in the Databricks UI come from integrating MLflow into the platform. We will dive deeper into MLflow in the coming chapters to understand more details; however, it’s important to get a sense of what MLflow is and some of the terminology that is MLflow-specific.

MLflow is an open source platform for managing the end-to-end ML life cycle. Here are the key components of MLflow:

  • Tracking: This allows you to track experiments to record and compare parameters and results.
  • Models: This component helps manage and deploy models from various ML libraries to a variety of model serving and inference platforms.
  • Projects: This allows you to package ML code in a reusable...

Discovering the feature store

The feature store is a relatively new yet stable release in the latest Databricks ML workspace. Many organizations that have mature ML processes in place, such as Uber, Facebook, DoorDash, and many more, have internally implemented their feature stores.

ML life cycle management and workflows are complex. Forbes conducted a survey (https://www.forbes.com/sites/gilpress/2016/03/23/data-preparation-most-time-consuming-least-enjoyable-data-science-task-survey-says) with data scientists and uncovered that managing data is the most expensive and time-consuming operation in their day-to-day work.

Data scientists need to spend a lot of time finding the data, cleaning it, doing EDA, and then performing feature engineering to train their ML models. This is an iterative process. The effort that needs to be put in to make the process repeatable is an enormous challenge. This is where feature stores come in.

Databricks Feature Store is standardized on the...

Discovering the model registry

Models is a fully managed and integrated MLflow model registry available to each deployed Databricks ML workspace. The registry has its own set of APIs and a UI to collaborate with data scientists across the organization and fully manage the MLflow model. Data scientists and ML engineers can develop models in any of the supported ML frameworks (https://mlflow.org/docs/latest/models.html#built-in-model-flavors) and package them in a generic MLfLow model format:

Figure 2.15 – The Models tab

Figure 2.15 – The Models tab

The model registry provides features to manage the versioning, tagging, and state transitioning between different environments (moving models from staging to production to archive):

Figure 2.16 – The Registered Models tab

Figure 2.16 – The Registered Models tab

Before we move on, there is another important feature that we need to understand: the Libraries feature of Databricks. This feature allows users to utilize third-party or custom...

Libraries

Libraries are fundamental building blocks of any programming ecosystem. They are akin to toolboxes, comprising pre-compiled routines that offer enhanced functionality and assist in optimizing code efficiency. In Databricks, libraries are used to make third-party or custom code available to notebooks and jobs running on clusters. These libraries can be written in various languages, including Python, Java, Scala, and R.

Storing libraries

When it comes to storage, libraries uploaded using the library UI are stored in the Databricks File System (DBFS) root. However, all workspace users can modify data and files stored in the DBFS root. If a more secure storage option is desired, you can opt to store libraries in cloud object storage, use library package repositories, or upload libraries to workspace files.

Managing libraries

Library management in Databricks can be handled via three different interfaces: the workspace UI, the command-line interface (CLI), or the Libraries...

Summary

In this chapter, we got a brief overview of all the components of the Databricks ML workspace. This will enable us to utilize these components in a more hands-on fashion so that we can train ML models and deploy them for various ML problems efficiently in the Databricks environment.

In the next chapter, we will start working on a customer churn prediction problem and register our first feature tables in the Databricks feature store.

Further reading

To learn more about the topics that were covered in this chapter, take a look at the following topics:

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Practical Machine Learning on Databricks
Published in: Nov 2023Publisher: PacktISBN-13: 9781801812030
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Debu Sinha

Debu is an experienced Data Science and Engineering leader with deep expertise in Software Engineering and Solutions Architecture. With over 10 years in the industry, Debu has a proven track record in designing scalable Software Applications, Big Data, and Machine Learning systems. As Lead ML Specialist on the Specialist Solutions Architect team at Databricks, Debu focuses on AI/ML use cases in the cloud and serves as an expert on LLMs, Machine Learning, and MLOps. With prior experience as a startup co-founder, Debu has demonstrated skills in team-building, scaling, and delivering impactful software solutions. An established thought leader, Debu has received multiple awards and regularly speaks at industry events.
Read more about Debu Sinha