You're reading from Optimizing Databricks Workloads

Product typeBook

Published inDec 2021

PublisherPackt

ISBN-139781801819077

Edition1st Edition

Concepts

Data Processing

Authors (3):

Anirudh Kala

Anshul Bhatnagar

Sarthak Sarbahi

View More author details

Chapter 4: Managing Spark Clusters

A Spark cluster in Azure Databricks is probably the most important entity in the service. Although it is managed for us from the infrastructure end, we must understand the right cluster setting for an environment.

In this chapter, we will learn about the best practices to manage our Spark clusters to optimize our workloads. We will also learn about the Databricks managed resource group, which will help us understand how Azure Databricks is provisioned.

We will learn how to optimize costs associated with Spark clusters with pools and spot instances. In the end, we will learn about the essential components of the Spark UI that can help us debug and optimize queries.

In this chapter, we will cover the following topics:

Designing Spark clusters
Learning about Databricks managed resource groups
Learning about Databricks Pools
Using spot instances
Following the Spark UI

Technical requirements

To follow the hands-on tutorials in the chapter, you will need the following:

An Azure subscription
Azure Databricks
Azure Databricks notebooks and a Spark cluster

Code samples from https://github.com/PacktPublishing/Optimizing-Databricks-Workload/tree/main/Chapter04

Designing Spark clusters

Designing a Spark cluster essentially means choosing the configurations for the cluster. Spark clusters in Databricks can be designed using the Compute section. Determining the right cluster configuration is very important for managing costs and data for different types of workloads. For example, a cluster that's used concurrently by several data analysts might not be a good fit for structured streaming or machine learning workloads. Before we decide on a Spark cluster configuration, several questions need to be asked:

Who will be the primary user of the cluster? It could be a data engineer, data scientist, data analyst, or machine learning engineer.
What kind of workloads run on the cluster? It could be an Extract, Transform, and Load (ETL) process for a data engineer or exploratory data analysis for a data scientist. An ETL process could also be further divided into batch and streaming workloads.
What is the service-level agreement (SLA...

Learning about Databricks managed resource groups

In this section, we will take a look at the managed resource group of a Databricks workspace. A managed resource group is a resource group that is automatically created by Azure when a Databricks workspace is created.

In the majority of cases, we do not need to do anything in a managed resource group. However, it is helpful to know the components that are created inside the managed resource group of an Azure Databricks workspace. This helps us understand how the Databricks workspace is functioning under the hood.

To start, let's create a new cluster with the following configuration:

Cluster Mode: Standard
Databricks Runtime Version: 8.3 (includes Apache Spark 3.1.1 and Scala 2.12)
Autoscaling: Disabled
Automatic termination: After 30 minutes of inactivity
Worker Type: Standard_DS3_v2
Number of workers: 1
Driver Type: Same as the worker

Let's wait for the cluster to spin up and don...

Learning about Databricks Pools

In this section, we will dive deeper into Azure Databricks Pools. We will start by creating a pool, attaching a cluster to a pool, and then learning about the best practices when using Pools in Azure Databricks.

Creating a pool

To create a pool, head over to the Databricks workspace. Then, click on Compute, select Pools, and click on + Create Pool. This will open a page where we need to define the pool's configuration, as shown in the following screenshot:

Figure 4.7 – Creating a pool in Azure Databricks

Let's discuss the configurations one by one:

Name: We need to give the pool a suitable name.
Min Idle: This defines the minimum number of idle instances that will be contained in the pool at any given time. These instances do not terminate and when consumed by a cluster, they will be replaced by another set of idle instances.
Max Capacity: This defines the maximum number of instances that...

Using spot instances

In this section, we will go through a quick tutorial on leveraging spot instances while creating our Databricks cluster. Let's begin by making a new cluster in the Databricks workspace. Create the following configuration for the cluster:

Cluster Name: DB_cluster_with_spot
Cluster Mode: Standard
Databricks Runtime Version: Runtime: 8.3 (Scala 2.12, Spark 3.1.1)
Autoscaling: Disabled
Automatic Termination: After 30 minutes
Worker Type: Standard_DS3_v2
Driver Type: Standard_DS3_v2
Number of workers: 1
Spot instances: Enabled

It will take a few minutes for the cluster to start. The following screenshot shows that our cluster with spot instances has been initialized:

Figure 4.12 – Databricks cluster with spot instances

To confirm that we are using a cluster with spot instances, we can click on the JSON option to view the JSON structure of the cluster. We can find this option on the same...

Following the Spark UI

The Spark UI is a web-based user interface that's used to monitor Spark jobs and is very helpful for optimizing workloads. In this section, we will learn about the major components of the Spark UI. To begin with, let's create a new Databricks cluster with the following configuration:

Cluster Mode: Standard
Databricks Runtime Version: 8.3 (includes Apache Spark 3.1.1, Scala 2.12)
Autoscaling: Disabled
Automatic Termination: After 30 minutes of inactivity
Worker Type: Standard_DS3_v2
Number of workers: 1
Spot instances: Disabled
Driver Type: Standard_DS3_v2

Once the cluster has started, create a new Databricks Python notebook. Next, let's run the following code block in a new cell:

from pyspark.sql.functions import *
# Define the schema for reading streaming
schema = "time STRING, action STRING"
# Creating a streaming dataframe 
stream_read = (spark
       ...

Summary

In this chapter, we learned about designing Spark clusters for different types of workloads. We also learned about Databricks Pools, spot instances, and the Spark UI. These features help reduce costs and help make Spark jobs more efficient when they're used for the right kind of workload. Now, you will be more confident in deciding on the correct cluster configuration for a certain type of workload. Moreover, the decision you make will be influenced by useful features such as spot instances, pools, and the Spark UI.

In the next chapter, we will dive into the Databricks optimization techniques concerning Spark DataFrames. We will learn about the various techniques and their applications in various scenarios.

The rest of the chapter is locked

You have been reading a chapter from

Optimizing Databricks Workloads

Published in: Dec 2021Publisher: PacktISBN-13: 9781801819077

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at $15.99/month. Cancel anytime

Authors (3)

Anirudh Kala

Anirudh Kala is an expert in machine learning techniques, artificial intelligence, and natural language processing. He has helped multiple organizations to run their large-scale data warehouses with quantitative research, natural language generation, data science exploration, and big data implementation. He has worked in every aspect of data analytics using the Azure data platform. Currently, he works as the director of Celebal Technologies, a data science boutique firm dedicated to large-scale analytics. Anirudh holds a computer engineering degree from the University of Rajasthan and his work history features the likes of IBM and ZS Associates.
Read more about Anirudh Kala

Anshul Bhatnagar

Anshul Bhatnagar is an experienced, hands-on data architect involved in the architecture, design, and implementation of data platform architectures, and distributed systems. He has worked in the IT industry since 2015 in a range of roles such as Hadoop/Spark developer, data engineer, and data architect. He has also worked in many other sectors including energy, media, telecoms, and e-commerce. He is currently working for a data and AI boutique company, Celebal Technologies, in India. He is always keen to hear about new ideas and technologies in the areas of big data and AI, so look him up on LinkedIn to ask questions or just to say hi.
Read more about Anshul Bhatnagar

Sarthak Sarbahi

Sarthak Sarbahi is a certified data engineer and analyst with a wide technical breadth and a deep understanding of Databricks. His background has led him to a variety of cloud data services with an eye toward data warehousing, big data analytics, robust data engineering, data science, and business intelligence. Sarthak graduated with a degree in mechanical engineering.
Read more about Sarthak Sarbahi

Personalised recommendations for you

Based on your interests and search pattern

Et al.

Ever wonder why speech recognition systems don't understand the Scottish accent, or what would happen if an astronaut only ate mac 'n' cheese, or other spurious reflections you'd have at a bar? We did, then collated those deliberations into absurd research articles with fake figures and methodologies inspired by even more fictionally absurd studies.

BookAug 2023230 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages4

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages1

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Mastering Tableau 2023

This book is a comprehensive resource to mastering your Tableau skills and becoming a BI expert. As you progress, you will learn how to build advanced dashboards and improve your storytelling to derive key business insight, as well as make you well-versed with advanced functionalities of Tableau in the business intelligence domain.

BookAug 2023684 pages

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages5

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages2

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

BookOct 2023636 pages5

Modern Data Architecture on AWS

Every organization wants an agile, performant, and cost-effective data platform that meets all their current and future business needs. Purpose-built AWS analytics services and their features play a big part in building such a modern data platform. This book brings to you all the design and architectural patterns that’ll help you achieve this goal.

BookAug 2023420 pages5

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

BookDec 2023240 pages

TinyML Cookbook

With over 70 project-based recipes, the TinyML Cookbook is a practical guide that will help you to get the most out of your microcontrollers. It provides a comprehensive understanding of the theoretical foundations while giving you hands-on experience training ML models for deployment on Arduino Nano 33 BLE Sense, Raspberry Pi Pico, and SparkFun RedBoard Artemis Nano microcontrollers.

BookNov 2023664 pages