Reader small image

You're reading from  Optimizing Databricks Workloads

Product typeBook
Published inDec 2021
PublisherPackt
ISBN-139781801819077
Edition1st Edition
Right arrow
Authors (3):
Anirudh Kala
Anirudh Kala
author image
Anirudh Kala

Anirudh Kala is an expert in machine learning techniques, artificial intelligence, and natural language processing. He has helped multiple organizations to run their large-scale data warehouses with quantitative research, natural language generation, data science exploration, and big data implementation. He has worked in every aspect of data analytics using the Azure data platform. Currently, he works as the director of Celebal Technologies, a data science boutique firm dedicated to large-scale analytics. Anirudh holds a computer engineering degree from the University of Rajasthan and his work history features the likes of IBM and ZS Associates.
Read more about Anirudh Kala

Anshul Bhatnagar
Anshul Bhatnagar
author image
Anshul Bhatnagar

Anshul Bhatnagar is an experienced, hands-on data architect involved in the architecture, design, and implementation of data platform architectures, and distributed systems. He has worked in the IT industry since 2015 in a range of roles such as Hadoop/Spark developer, data engineer, and data architect. He has also worked in many other sectors including energy, media, telecoms, and e-commerce. He is currently working for a data and AI boutique company, Celebal Technologies, in India. He is always keen to hear about new ideas and technologies in the areas of big data and AI, so look him up on LinkedIn to ask questions or just to say hi.
Read more about Anshul Bhatnagar

Sarthak Sarbahi
Sarthak Sarbahi
author image
Sarthak Sarbahi

Sarthak Sarbahi is a certified data engineer and analyst with a wide technical breadth and a deep understanding of Databricks. His background has led him to a variety of cloud data services with an eye toward data warehousing, big data analytics, robust data engineering, data science, and business intelligence. Sarthak graduated with a degree in mechanical engineering.
Read more about Sarthak Sarbahi

View More author details
Right arrow

Chapter 4: Managing Spark Clusters

A Spark cluster in Azure Databricks is probably the most important entity in the service. Although it is managed for us from the infrastructure end, we must understand the right cluster setting for an environment.

In this chapter, we will learn about the best practices to manage our Spark clusters to optimize our workloads. We will also learn about the Databricks managed resource group, which will help us understand how Azure Databricks is provisioned.

We will learn how to optimize costs associated with Spark clusters with pools and spot instances. In the end, we will learn about the essential components of the Spark UI that can help us debug and optimize queries.

In this chapter, we will cover the following topics:

  • Designing Spark clusters
  • Learning about Databricks managed resource groups
  • Learning about Databricks Pools
  • Using spot instances
  • Following the Spark UI

Technical requirements

To follow the hands-on tutorials in the chapter, you will need the following:

  • An Azure subscription
  • Azure Databricks
  • Azure Databricks notebooks and a Spark cluster

Code samples from https://github.com/PacktPublishing/Optimizing-Databricks-Workload/tree/main/Chapter04

Designing Spark clusters

Designing a Spark cluster essentially means choosing the configurations for the cluster. Spark clusters in Databricks can be designed using the Compute section. Determining the right cluster configuration is very important for managing costs and data for different types of workloads. For example, a cluster that's used concurrently by several data analysts might not be a good fit for structured streaming or machine learning workloads. Before we decide on a Spark cluster configuration, several questions need to be asked:

  • Who will be the primary user of the cluster? It could be a data engineer, data scientist, data analyst, or machine learning engineer.
  • What kind of workloads run on the cluster? It could be an Extract, Transform, and Load (ETL) process for a data engineer or exploratory data analysis for a data scientist. An ETL process could also be further divided into batch and streaming workloads.
  • What is the service-level agreement (SLA...

Learning about Databricks managed resource groups

In this section, we will take a look at the managed resource group of a Databricks workspace. A managed resource group is a resource group that is automatically created by Azure when a Databricks workspace is created.

In the majority of cases, we do not need to do anything in a managed resource group. However, it is helpful to know the components that are created inside the managed resource group of an Azure Databricks workspace. This helps us understand how the Databricks workspace is functioning under the hood.

To start, let's create a new cluster with the following configuration:

  • Cluster Mode: Standard
  • Databricks Runtime Version: 8.3 (includes Apache Spark 3.1.1 and Scala 2.12)
  • Autoscaling: Disabled
  • Automatic termination: After 30 minutes of inactivity
  • Worker Type: Standard_DS3_v2
  • Number of workers: 1
  • Driver Type: Same as the worker

Let's wait for the cluster to spin up and don...

Learning about Databricks Pools

In this section, we will dive deeper into Azure Databricks Pools. We will start by creating a pool, attaching a cluster to a pool, and then learning about the best practices when using Pools in Azure Databricks.

Creating a pool

To create a pool, head over to the Databricks workspace. Then, click on Compute, select Pools, and click on + Create Pool. This will open a page where we need to define the pool's configuration, as shown in the following screenshot:

Figure 4.7 – Creating a pool in Azure Databricks

Let's discuss the configurations one by one:

  • Name: We need to give the pool a suitable name.
  • Min Idle: This defines the minimum number of idle instances that will be contained in the pool at any given time. These instances do not terminate and when consumed by a cluster, they will be replaced by another set of idle instances.
  • Max Capacity: This defines the maximum number of instances that...

Using spot instances

In this section, we will go through a quick tutorial on leveraging spot instances while creating our Databricks cluster. Let's begin by making a new cluster in the Databricks workspace. Create the following configuration for the cluster:

  • Cluster Name: DB_cluster_with_spot
  • Cluster Mode: Standard
  • Databricks Runtime Version: Runtime: 8.3 (Scala 2.12, Spark 3.1.1)
  • Autoscaling: Disabled
  • Automatic Termination: After 30 minutes
  • Worker Type: Standard_DS3_v2
  • Driver Type: Standard_DS3_v2
  • Number of workers: 1
  • Spot instances: Enabled

It will take a few minutes for the cluster to start. The following screenshot shows that our cluster with spot instances has been initialized:

Figure 4.12 – Databricks cluster with spot instances

To confirm that we are using a cluster with spot instances, we can click on the JSON option to view the JSON structure of the cluster. We can find this option on the same...

Following the Spark UI

The Spark UI is a web-based user interface that's used to monitor Spark jobs and is very helpful for optimizing workloads. In this section, we will learn about the major components of the Spark UI. To begin with, let's create a new Databricks cluster with the following configuration:

  • Cluster Mode: Standard
  • Databricks Runtime Version: 8.3 (includes Apache Spark 3.1.1, Scala 2.12)
  • Autoscaling: Disabled
  • Automatic Termination: After 30 minutes of inactivity
  • Worker Type: Standard_DS3_v2
  • Number of workers: 1
  • Spot instances: Disabled
  • Driver Type: Standard_DS3_v2

Once the cluster has started, create a new Databricks Python notebook. Next, let's run the following code block in a new cell:

from pyspark.sql.functions import *
# Define the schema for reading streaming
schema = "time STRING, action STRING"
# Creating a streaming dataframe 
stream_read = (spark
       ...

Summary

In this chapter, we learned about designing Spark clusters for different types of workloads. We also learned about Databricks Pools, spot instances, and the Spark UI. These features help reduce costs and help make Spark jobs more efficient when they're used for the right kind of workload. Now, you will be more confident in deciding on the correct cluster configuration for a certain type of workload. Moreover, the decision you make will be influenced by useful features such as spot instances, pools, and the Spark UI.

In the next chapter, we will dive into the Databricks optimization techniques concerning Spark DataFrames. We will learn about the various techniques and their applications in various scenarios.

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Optimizing Databricks Workloads
Published in: Dec 2021Publisher: PacktISBN-13: 9781801819077
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Authors (3)

author image
Anirudh Kala

Anirudh Kala is an expert in machine learning techniques, artificial intelligence, and natural language processing. He has helped multiple organizations to run their large-scale data warehouses with quantitative research, natural language generation, data science exploration, and big data implementation. He has worked in every aspect of data analytics using the Azure data platform. Currently, he works as the director of Celebal Technologies, a data science boutique firm dedicated to large-scale analytics. Anirudh holds a computer engineering degree from the University of Rajasthan and his work history features the likes of IBM and ZS Associates.
Read more about Anirudh Kala

author image
Anshul Bhatnagar

Anshul Bhatnagar is an experienced, hands-on data architect involved in the architecture, design, and implementation of data platform architectures, and distributed systems. He has worked in the IT industry since 2015 in a range of roles such as Hadoop/Spark developer, data engineer, and data architect. He has also worked in many other sectors including energy, media, telecoms, and e-commerce. He is currently working for a data and AI boutique company, Celebal Technologies, in India. He is always keen to hear about new ideas and technologies in the areas of big data and AI, so look him up on LinkedIn to ask questions or just to say hi.
Read more about Anshul Bhatnagar

author image
Sarthak Sarbahi

Sarthak Sarbahi is a certified data engineer and analyst with a wide technical breadth and a deep understanding of Databricks. His background has led him to a variety of cloud data services with an eye toward data warehousing, big data analytics, robust data engineering, data science, and business intelligence. Sarthak graduated with a degree in mechanical engineering.
Read more about Sarthak Sarbahi