You're reading from Azure Databricks Cookbook

Product typeBook

Published inSep 2021

PublisherPackt

ISBN-139781789809718

Edition1st Edition

Concepts

Data Streaming

Authors (2):

Phani Raj

Vinod Jaiswal

View More author details

Chapter 3: Understanding Spark Query Execution

To write efficient Spark applications, we need to have some understanding of how Spark executes queries. Having a good understanding of how Spark executes a given query helps big data developers/engineers work efficiently with large volumes of data.

Query execution is a very broad subject, and, in this chapter, we will start by understanding jobs, stages, and tasks. Then, we will learn how Spark lazy evaluation works. Following this, we will learn how to check and understand the execution plan when working with DataFrames or SparkSQL. Later, we will learn how joins work in Spark and the different types of join algorithms Spark uses while joining two tables. Finally, we will learn about the input, output, and shuffle partitions and the storage benefits of using different file formats.

Knowing about the internals will help you troubleshoot and debug your Spark applications more efficiently. By the end of this chapter, you will know...

Technical requirements

To follow along with the examples in this recipe, you will need to do the following:

Create an Azure subscription and the required permissions for your subscription. These were mentioned in the Technical requirements section of Chapter 1, Creating an Azure Databricks Service.
Create ADLS Gen-2 accounts. Go to the following link to create the required resources: https://docs.microsoft.com/en-us/azure/storage/common/storage-account-create?tabs=azure-portal.
Acquire the relevant scripts. You can find the scripts for this chapter in the GitHub URL (https://github.com/PacktPublishing/Azure-Databricks-Cookbook/tree/main/Chapter03). The Chapter03 folder contains all the notebooks for this chapter.
Acquire the relevant source files. You can find all the source files that were used in this chapter at https://github.com/PacktPublishing/Azure-Databricks-Cookbook/tree/main/Common.

Introduction to jobs, stages, and tasks

In this recipe, you will learn how Spark breaks down an application into job, stages, and tasks. You will also learn how to view directed acyclic graphs (DAGs) and how pipelining works in Spark query execution.

By the end of this recipe, you will have learned how to check the DAG you've created for the query you have executed and look at the jobs, stages, and tasks associated with a specific query.

Getting ready

You can follow along by running the steps in the 3-1.Introduction to Jobs, Stages, and Tasks notebook. This can be found in your local cloned repository, in the Chapter03 folder (https://github.com/PacktPublishing/Azure-Databricks-Cookbook/tree/main/Chapter03). Follow these steps before running the notebook:

Mount your ADLS Gen-2 account by following the steps mentioned in the Mounting Azure Data Lake Storage (ADLS) Gen-2 and Azure Blob Storage to the Azure Databricks filesystem recipe of Chapter 2, Reading and...

Checking the execution details of all the executed Spark queries via the Spark UI

In this recipe, you will learn how to view the statuses of all the running applications in your cluster. You will also look at various tasks you can use to identify if there are any issues with a specific application/query. Knowing this is useful as you will get holistic information about how your cluster is utilized in terms of tasks distribution and how your applications are running.

Getting ready

Execute the queries shown in the Introduction to jobs, stages, and tasks recipe of this chapter. You can either use a Spark 2.x (latest version) or Spark 3.x cluster.

How to do it…

Follow these steps to learn about the running applications/queries in your cluster:

When you are in your Databricks workspace, click on the Clusters option and then on the cluster that you are using. Then, click on the Spark UI tab, as shown in the following screenshot:

Figure 3.10 – Spark UI screen...

Deep diving into schema inference

In this recipe, you will learn about the benefits of explicitly specifying a schema while reading any file format data from an ADLS Gen-2 or Azure Blob storage account.

By the end of this recipe, you will have learned how Spark executes a query when a schema is inferred versus explicitly specified.

Getting ready

You need to ensure you have done the following before you start working on this recipe:

An ADLS Gen-2 account mounted.
Follow along by running the steps in the 3-3.Schema Inference notebook in your local cloned repository. This can be found in the Chapter03 folder (https://github.com/PacktPublishing/Azure-Databricks-Cookbook/tree/main/Chapter03).
Upload the csvFiles folder in the Common/Customer folder (https://github.com/PacktPublishing/Azure-Databricks-Cookbook/tree/main/Common/Customer/csvFiles) to your ADLS Gen-2 account in the rawdata filesystem, inside the Customer folder.

How to do it…

You can...

Looking into the query execution plan

It's important to understand the execution plan and how to view its different stages when the Spark optimizer executes a query using a dataframe or the SparkSQL API.

In this recipe, we will learn how to create a logical and physical plan and the different stages involved. By the end of this recipe, you will have generated an execution plan using the dataframe API or the SparkSQL API and have a fair understanding of the different stages involved.

Getting ready

You can follow along by running the steps in the 3-4.Query Execution Plan notebook in your local cloned repository, which can be found in the Chapter03 folder (https://github.com/PacktPublishing/Azure-Databricks-Cookbook/tree/main/Chapter03).

Upload the csvFiles folder in the Common/Customer folder (https://github.com/PacktPublishing/Azure-Databricks-Cookbook/tree/main/Common/Customer/csvFiles) to your ADLS Gen-2 account. This can be found in the rawdata filesystem, inside...

How joins work in Spark

In this recipe, you will learn how query joins are executed by the Spark optimizer using different types of sorting algorithms such as SortMerge and BroadcastHash joins. You will learn how to identify which algorithm has been used by looking at the DAG that Spark generates. You will also learn how to use the hints that are provided in the queries to influence the optimizer to use a specific join algorithm.

Getting ready

To follow along with this recipe, run the cells in the 3-5.Joins notebook, which you can find in your local cloned repository, in the Chapter03 folder (https://github.com/PacktPublishing/Azure-Databricks-Cookbook/tree/main/Chapter03).

Upload the csvFiles folders, which can be found in the Common/Customer and Common/Orders folders in your local cloned repository, to the ADLS Gen-2 account in the rawdata filesystem. You will need to create two folders called Customer and Orders in the rawdata filesystem:

Figure...

Learning about input partitions

Partitions are subsets of files in memory or storage. In Spark, partitions are more utilized compared to the Hive system or SQL databases. Spark uses partitions for parallel processing and to gain maximum performance.

Spark and Hive partitions are different; Spark processes data in memory, whereas Hive partitions are in storage. In this recipe, we will cover three different partitions; that is, the input, shuffle, and output partitions.

Let's start by looking at input partitions.

Getting ready

Apache Spark has a layered architecture, and the driver nodes communicate with the worker nodes to get the job done. All the data processing happens in the worker nodes. When the job is submitted for processing, each data partition is sent to the specific executors. Each executor processes one partition at a time. Hence, the time it takes each executor to process data is directly proportional to the size and number of partitions. The more...

Learning about output partitions

Saving partitioned data using the proper condition can significantly boost performance while you're reading and retrieving data for further processing.

Reading the required partition limits the number of files and partitions that Spark reads while querying data. It also helps with dynamic partition pruning.

But sometimes, too many optimizations can make things worse. For example, if you have several partitions, data is scattered within multiple files, so searching the data for particular conditions in the initial query can take time. Also, memory utilization will be more while processing the metadata table as it contains several partitions.

While saving the in-memory data to disk, you must consider the partition sizes as Spark produces files for each task. Let's consider a scenario: if the cluster configuration has more memory for processing the dataframe and saving it as larger partition sizes, then processing the same data...

Learning about shuffle partitions

In this recipe, you will learn how to set the spark.sql.shuffle.partitions parameter and see the impact it has on performance when there are fewer partitions.

Most of the time, in the case of wide transformations, where data is required from other partitions, Spark performs a data shuffle. Unfortunately, you can't avoid such transformations, but we can configure parameters to reduce the impact this has on performance.

Wide transformations uses shuffle partitions to shuffle data. However, irrespective of the data's size or the number of executors, the number of partitions is set to 200.

The data shuffle procedure is triggered by data transformations such as join(), union(), groupByKey(), reduceBykey(), and so on. The spark.sql.shuffle.partitions configuration sets the number of partitions to use during data shuffling. The partition numbers are set to 200 by default when Spark performs data shuffling.

Getting ready

You can...

Storage benefits of different file types

Storage formats are a way to define how data is stored in a file. Hadoop doesn't have a default file format, but it supports multiple file formats for storing data. Some of the common storage formats for Hadoop are as follows:

Text files
Sequence files
Parquet files
Record-columnar (RC) files
Optimized row columnar (ORC) files
Avro files

Choosing a write file format will provide significant advantages, such as the following:

Optimized performance while reading and writing data
Schema evaluation support (allows us to change the attributes in a dataset)
Higher compression, resulting in less storage space being required
Splittable files (files can be read in parts)

Let's focus on columnar storage formats as they are widely used in big data applications because of how they store data and can be queried by the SQL engine. The columnar format is very useful when a subset of data...

The rest of the chapter is locked

You have been reading a chapter from

Azure Databricks Cookbook

Published in: Sep 2021Publisher: PacktISBN-13: 9781789809718

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at $15.99/month. Cancel anytime

Authors (2)

Phani Raj

Phani Raj is an experienced data architect and a product manager having 15 years of experience working with customers on building data platforms on both on-prem and on cloud. Worked on designing and implementing large scale big data solutions for customers on different verticals. His passion for continuous learning and adapting to the dynamic nature of technology underscores his role as a trusted advisor in the realm of data architecture ,data science and product management.
Read more about Phani Raj

Vinod Jaiswal

Vinod Jaiswal is an experienced data engineer, excels in transforming raw data into valuable insights. With over 8 years in Databricks, he designs and implements data pipelines, optimizes workflows, and crafts scalable solutions for intricate data challenges. Collaborating seamlessly with diverse teams, Vinod empowers them with tools and expertise to leverage data effectively. His dedication to staying updated on the latest data engineering trends ensures cutting-edge, robust solutions. Apart from technical prowess, Vinod is a proficient educator. Through presentations and mentoring, he shares his expertise, enabling others to harness the power of data within the Databricks ecosystem.
Read more about Vinod Jaiswal

Other recommended products

Related to this chapter

Azure Data Engineering Cookbook

This book will help you design and implement modern ETL workflows along with data management, monitoring, and security aspects to meet the current organization's needs. You will use various services such as Azure Data Factory, Azure Databricks, Azure Stream Analytics, and Azure Data Explorer to design efficient data processing solutions.

BookApr 2021454 pages

Distributed Data Systems with Azure Databricks

This book helps you to learn how to extract, transform, and orchestrate massive amounts of data to develop robust data pipelines. You'll perform complex machine learning tasks using advanced Azure Databricks features, and also explore model tuning, deployment, and control using Databricks functionalities such as AutoML and Delta Lake with TensorFlow.

BookMay 2021414 pages

Azure Data Factory Cookbook

With the help of well-structured and practical recipes, this book will teach you how to integrate data from the cloud and on-premise. You’ll learn how to transform, clean, and consolidate data into a single data platform and get to grips with using ADF as the main ETL and orchestration tool for your data warehouse or data platform project.

BookDec 2020382 pages

Cloud Analytics with Microsoft Azure

Cloud Analytics with Microsoft Azure is an end-to-end guide to processing and analyzing big data using a range of Microsoft Azure features. This book covers everything you need to build your own data warehouse and learn numerous techniques to gain useful insights by analyzing big data.

BookNov 2019242 pages

Limitless Analytics with Azure Synapse

This book helps you understand the basic concepts and techniques of using Azure Synapse step-by-step. You'll gradually gain the skills you need to work with data and develop analytics solutions using the Azure analytics platform even with no prior knowledge of Azure.

BookJun 2021392 pages

Cloud Scale Analytics with Azure Data Services

This book will help you to understand the architectural components of a modern data warehouse and select those suitable for your requirements. You’ll learn everything from how to integrate your source data into Azure Data Lake at scale to how to structure your analytical data estate and more.

BookJul 2021520 pages

ETL with Azure Cookbook

This book will take you through hand-on recipes for extracting, transforming, and loading data using big data tools and Azure services such as Data Factory and Azure Databricks. You will learn how to interact effectively with Azure services, along with covering automation with BIML and data profiling in Azure.

BookSep 2020446 pages

Stream Analytics with Microsoft Azure

This book is your guide to understanding the basics of how Azure Stream Analytics works, and build your own analytics solution using its capabilities. By the end of this book, you will be well-versed in using Azure Stream Analytics to develop an efficient analytics solution which can work with any type of data.

BookDec 2017322 pages

Data Modeling for Azure Data Services

Data modeling for Azure Data Services teaches you the core concepts of setting up different types of databases for different use cases. With this hands-on guide, you'll learn how to implement the resulting data model in Azure efficiently.

BookJul 2021428 pages

Exam Ref AZ-304 Microsoft Azure Architect Design Certification and Beyond

If you're taking the AZ-304 Microsoft Azure Architect Design exam, you need to know which Azure technologies to choose and when. Exam Ref AZ-304 Microsoft Azure Architect Design Certification and Beyond prepares you for the AZ-304 exam and shows you how to design scalable and secure solutions using compute, storage, data, monitoring, and logging.

BookJul 2021520 pages

Azure for Architects

Azure cloud services have risen rapidly and there is also a gradual increase in the number of organizations that adopt Azure for their cloud services. This 3rd edition will assist readers to create a comprehensive Azure cloud solution that is Enterprise-class and ready for the future.

BookJul 2020698 pages

Personalised recommendations for you

Based on your interests and search pattern

Et al.

Ever wonder why speech recognition systems don't understand the Scottish accent, or what would happen if an astronaut only ate mac 'n' cheese, or other spurious reflections you'd have at a bar? We did, then collated those deliberations into absurd research articles with fake figures and methodologies inspired by even more fictionally absurd studies.

BookAug 2023230 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages4

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages1

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Mastering Tableau 2023

This book is a comprehensive resource to mastering your Tableau skills and becoming a BI expert. As you progress, you will learn how to build advanced dashboards and improve your storytelling to derive key business insight, as well as make you well-versed with advanced functionalities of Tableau in the business intelligence domain.

BookAug 2023684 pages

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages5

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages2

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

BookOct 2023636 pages5

Modern Data Architecture on AWS

Every organization wants an agile, performant, and cost-effective data platform that meets all their current and future business needs. Purpose-built AWS analytics services and their features play a big part in building such a modern data platform. This book brings to you all the design and architectural patterns that’ll help you achieve this goal.

BookAug 2023420 pages5

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

BookDec 2023240 pages

TinyML Cookbook

With over 70 project-based recipes, the TinyML Cookbook is a practical guide that will help you to get the most out of your microcontrollers. It provides a comprehensive understanding of the theoretical foundations while giving you hands-on experience training ML models for deployment on Arduino Nano 33 BLE Sense, Raspberry Pi Pico, and SparkFun RedBoard Artemis Nano microcontrollers.

BookNov 2023664 pages