You're reading from Practical Machine Learning on Databricks

Product typeBook

Published inNov 2023

Reading LevelIntermediate

PublisherPackt

ISBN-139781801812030

Edition1st Edition

Languages

Python

Concepts

Data Science

Author (1)

Debu Sinha

Overview of ML on Databricks

This chapter will give you a fundamental understanding of how to get started with ML on Databricks. The ML workspace is data scientist-friendly and allows rapid ML development by providing out-of-the-box support for popular ML libraries such as TensorFlow, PyTorch, and many more.

We will cover setting up a trial Databricks account and learn about the various ML-specific features available at ML practitioners’ fingertips in the Databricks workspace. You will learn how to start a cluster on Databricks and create a new notebook.

In this chapter, we will cover these main topics:

Setting up a Databricks trial account
Introduction to the ML workspace on Databricks
Exploring the workspace
Exploring clusters
Exploring notebooks
Exploring data
Exploring experiments
Discovering the feature store
Discovering the model registry
Libraries

These topics will cover the essential features to perform effective...

Technical requirements

For this chapter, you’ll need access to the Databricks workspace with cluster creation privileges. By default, the owner of the workspace has permission to create clusters. We will cover clusters in more detail in the Exploring clusters sections. You can read more about the various cluster access control options here: https://docs.databricks.com/security/access-control/cluster-acl.html.

Setting up a Databricks trial account

At the time of writing, Databricks is available on all the major cloud platforms, namely Google Cloud Platform (GCP), Microsoft Azure, and Amazon Web Services (AWS).

Databricks provides an easy way to either create an account within the community edition or start a 14-day trial with all the enterprise features available in the workspace.

To fully leverage the code examples provided in this book and explore the enterprise features we’ll cover, I highly recommend taking advantage of the 14-day trial option. This trial will grant you access to all the necessary functionalities, ensuring a seamless experience throughout your learning journey.

Please go through this link to sign up for trial account: https://www.databricks.com/try-databricks?itm_data=PricingPage-Trial#account

On filling out the introductory form, you will be redirected to a page that will provide you with options to start with trial deployments on either of the three...

Exploring the workspace

The workspace is within a Databricks ML environment. Each user of the Databricks ML environment will have a workspace. Users can create notebooks and develop code in isolation or collaborate with other teammates through granular access controls. You will be working within the workspace or repos for most of your time in the Databricks environment. We will learn more about repos in the Repos section:

Figure 2.3 – The Workspace tab

It’s important to note that the Workspace area is primarily intended for Databricks notebooks. While the workspace does support version control for notebooks using Git providers within Databricks, it’s worth highlighting that this version control capability within workspace notebooks is now considered less recommended compared to using repos.

Version control, in the context of software development, is a system that helps track changes made to files over time. It allows you to maintain...

Exploring clusters

Clusters are the primary computing units that will do the heavy lifting when you’re training your ML models. The VMs associated with a cluster are provisioned in Databricks users’ cloud subscriptions; however, the Databricks UI provides an interface to control the cluster type and its settings.

Clusters are ephemeral compute resources. No data is stored on clusters:

Figure 2.6 – The Clusters tab

The Pools feature allows end users to create Databricks VM pools. One of the benefits of working in the cloud environment is that you can request new compute resources on demand. The end user pays by the second and returns the compute once the load on the cluster is low. This is great; however, requesting a VM from the cloud provider, ramping it up, and adding it to a cluster still takes some time. Using pools, you can pre-provision VMs to keep them in a standby state. If a cluster requests new nodes and has access...

Exploring notebooks

If you are familiar with Jupyter and IPython notebooks, then Databricks notebooks will look very familiar. A Databricks notebook development environment consists of cells where end users can interactively write code in R, Python, Scala, or SQL.

Databricks notebooks also have additional functionalities such as integration with the Spark UI, powerful integrated visualizations, version control, and an MLflow model tracking server. We can also parameterize a notebook and pass parameters to it at execution time.

We will cover notebooks in more detail as the code examples presented to you in this book utilize the Databricks notebook environment. Additional details about notebooks can be found at https://docs.databricks.com/notebooks/index.html:

Figure 2.9 – Databricks notebooks

Let’s take a look at the next feature on the Data tab, also called the Databricks metastore.

Exploring data

By default, when a new workspace is deployed, it comes with a managed Hive metastore. A metastore allows you to register datasets in various formats such as Comma-Separated Values (CSV), Parquet, Delta format, text, or JavaScript Object Notation (JSON) as an external table (https://docs.databricks.com/data/metastores/index.html). We will not go too much into detail about the metastore here:

Figure 2.10 – The Data tab

It’s all right if you are not familiar with the term metastore. In simple terms, it is similar to a relational database. In relational databases, there are databases and then table names and schemas. The end user can use SQL to interact with the data stored in databases and tables. Similarly, in Databricks, end users can decide to register datasets stored in cloud storage so that they’re available as tables. You can learn more here: https://docs.databricks.com/spark/latest/spark-sql/language-manual/index...

Exploring experiments

As the name suggests, experiments are the central location where all the model training pertinent to business problems can be accessed. Users can define their name for the experiment or a default system-generated one and use it to train the different ML model training runs. Experiments in the Databricks UI come from integrating MLflow into the platform. We will dive deeper into MLflow in the coming chapters to understand more details; however, it’s important to get a sense of what MLflow is and some of the terminology that is MLflow-specific.

MLflow is an open source platform for managing the end-to-end ML life cycle. Here are the key components of MLflow:

Tracking: This allows you to track experiments to record and compare parameters and results.
Models: This component helps manage and deploy models from various ML libraries to a variety of model serving and inference platforms.
Projects: This allows you to package ML code in a reusable...

Discovering the feature store

The feature store is a relatively new yet stable release in the latest Databricks ML workspace. Many organizations that have mature ML processes in place, such as Uber, Facebook, DoorDash, and many more, have internally implemented their feature stores.

ML life cycle management and workflows are complex. Forbes conducted a survey (https://www.forbes.com/sites/gilpress/2016/03/23/data-preparation-most-time-consuming-least-enjoyable-data-science-task-survey-says) with data scientists and uncovered that managing data is the most expensive and time-consuming operation in their day-to-day work.

Data scientists need to spend a lot of time finding the data, cleaning it, doing EDA, and then performing feature engineering to train their ML models. This is an iterative process. The effort that needs to be put in to make the process repeatable is an enormous challenge. This is where feature stores come in.

Databricks Feature Store is standardized on the...

Discovering the model registry

Models is a fully managed and integrated MLflow model registry available to each deployed Databricks ML workspace. The registry has its own set of APIs and a UI to collaborate with data scientists across the organization and fully manage the MLflow model. Data scientists and ML engineers can develop models in any of the supported ML frameworks (https://mlflow.org/docs/latest/models.html#built-in-model-flavors) and package them in a generic MLfLow model format:

Figure 2.15 – The Models tab

The model registry provides features to manage the versioning, tagging, and state transitioning between different environments (moving models from staging to production to archive):

Figure 2.16 – The Registered Models tab

Before we move on, there is another important feature that we need to understand: the Libraries feature of Databricks. This feature allows users to utilize third-party or custom...

Libraries

Libraries are fundamental building blocks of any programming ecosystem. They are akin to toolboxes, comprising pre-compiled routines that offer enhanced functionality and assist in optimizing code efficiency. In Databricks, libraries are used to make third-party or custom code available to notebooks and jobs running on clusters. These libraries can be written in various languages, including Python, Java, Scala, and R.

Storing libraries

When it comes to storage, libraries uploaded using the library UI are stored in the Databricks File System (DBFS) root. However, all workspace users can modify data and files stored in the DBFS root. If a more secure storage option is desired, you can opt to store libraries in cloud object storage, use library package repositories, or upload libraries to workspace files.

Managing libraries

Library management in Databricks can be handled via three different interfaces: the workspace UI, the command-line interface (CLI), or the Libraries...

Summary

In this chapter, we got a brief overview of all the components of the Databricks ML workspace. This will enable us to utilize these components in a more hands-on fashion so that we can train ML models and deploy them for various ML problems efficiently in the Databricks environment.

In the next chapter, we will start working on a customer churn prediction problem and register our first feature tables in the Databricks feature store.

Debu is an experienced Data Science and Engineering leader with deep expertise in Software Engineering and Solutions Architecture. With over 10 years in the industry, Debu has a proven track record in designing scalable Software Applications, Big Data, and Machine Learning systems. As Lead ML Specialist on the Specialist Solutions Architect team at Databricks, Debu focuses on AI/ML use cases in the cloud and serves as an expert on LLMs, Machine Learning, and MLOps. With prior experience as a startup co-founder, Debu has demonstrated skills in team-building, scaling, and delivering impactful software solutions. An established thought leader, Debu has received multiple awards and regularly speaks at industry events.
Read more about Debu Sinha

Personalised recommendations for you

Based on your interests and search pattern

Et al.

Ever wonder why speech recognition systems don't understand the Scottish accent, or what would happen if an astronaut only ate mac 'n' cheese, or other spurious reflections you'd have at a bar? We did, then collated those deliberations into absurd research articles with fake figures and methodologies inspired by even more fictionally absurd studies.

BookAug 2023230 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages4

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages1

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Mastering Tableau 2023

This book is a comprehensive resource to mastering your Tableau skills and becoming a BI expert. As you progress, you will learn how to build advanced dashboards and improve your storytelling to derive key business insight, as well as make you well-versed with advanced functionalities of Tableau in the business intelligence domain.

BookAug 2023684 pages

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages5

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages2

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

BookOct 2023636 pages5

Modern Data Architecture on AWS

Every organization wants an agile, performant, and cost-effective data platform that meets all their current and future business needs. Purpose-built AWS analytics services and their features play a big part in building such a modern data platform. This book brings to you all the design and architectural patterns that’ll help you achieve this goal.

BookAug 2023420 pages5

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

BookDec 2023240 pages

TinyML Cookbook

With over 70 project-based recipes, the TinyML Cookbook is a practical guide that will help you to get the most out of your microcontrollers. It provides a comprehensive understanding of the theoretical foundations while giving you hands-on experience training ML models for deployment on Arduino Nano 33 BLE Sense, Raspberry Pi Pico, and SparkFun RedBoard Artemis Nano microcontrollers.

BookNov 2023664 pages

You're reading from Practical Machine Learning on Databricks

Overview of ML on Databricks

Technical requirements

Setting up a Databricks trial account

Exploring the workspace

Exploring clusters

Exploring notebooks

Exploring data

Exploring experiments

Discovering the feature store

Discovering the model registry

Libraries

Storing libraries

Managing libraries

Summary

Further reading

Unlock this book and the full library FREE for 7 days

Author (1)

Et al.

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

Mastering Tableau 2023

Building AI Applications with ChatGPT APIs

Building AI Applications with ChatGPT APIs

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

Modern Data Architecture on AWS

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

TinyML Cookbook