Reader small image

You're reading from  Azure Databricks Cookbook

Product typeBook
Published inSep 2021
PublisherPackt
ISBN-139781789809718
Edition1st Edition
Right arrow
Authors (2):
Phani Raj
Phani Raj
author image
Phani Raj

Phani Raj is an experienced data architect and a product manager having 15 years of experience working with customers on building data platforms on both on-prem and on cloud. Worked on designing and implementing large scale big data solutions for customers on different verticals. His passion for continuous learning and adapting to the dynamic nature of technology underscores his role as a trusted advisor in the realm of data architecture ,data science and product management.
Read more about Phani Raj

Vinod Jaiswal
Vinod Jaiswal
author image
Vinod Jaiswal

Vinod Jaiswal is an experienced data engineer, excels in transforming raw data into valuable insights. With over 8 years in Databricks, he designs and implements data pipelines, optimizes workflows, and crafts scalable solutions for intricate data challenges. Collaborating seamlessly with diverse teams, Vinod empowers them with tools and expertise to leverage data effectively. His dedication to staying updated on the latest data engineering trends ensures cutting-edge, robust solutions. Apart from technical prowess, Vinod is a proficient educator. Through presentations and mentoring, he shares his expertise, enabling others to harness the power of data within the Databricks ecosystem.
Read more about Vinod Jaiswal

View More author details
Right arrow

Chapter 3: Understanding Spark Query Execution

To write efficient Spark applications, we need to have some understanding of how Spark executes queries. Having a good understanding of how Spark executes a given query helps big data developers/engineers work efficiently with large volumes of data.

Query execution is a very broad subject, and, in this chapter, we will start by understanding jobs, stages, and tasks. Then, we will learn how Spark lazy evaluation works. Following this, we will learn how to check and understand the execution plan when working with DataFrames or SparkSQL. Later, we will learn how joins work in Spark and the different types of join algorithms Spark uses while joining two tables. Finally, we will learn about the input, output, and shuffle partitions and the storage benefits of using different file formats.

Knowing about the internals will help you troubleshoot and debug your Spark applications more efficiently. By the end of this chapter, you will know...

Technical requirements

To follow along with the examples in this recipe, you will need to do the following:

Introduction to jobs, stages, and tasks

In this recipe, you will learn how Spark breaks down an application into job, stages, and tasks. You will also learn how to view directed acyclic graphs (DAGs) and how pipelining works in Spark query execution.

By the end of this recipe, you will have learned how to check the DAG you've created for the query you have executed and look at the jobs, stages, and tasks associated with a specific query.

Getting ready

You can follow along by running the steps in the 3-1.Introduction to Jobs, Stages, and Tasks notebook. This can be found in your local cloned repository, in the Chapter03 folder (https://github.com/PacktPublishing/Azure-Databricks-Cookbook/tree/main/Chapter03). Follow these steps before running the notebook:

  1. Mount your ADLS Gen-2 account by following the steps mentioned in the Mounting Azure Data Lake Storage (ADLS) Gen-2 and Azure Blob Storage to the Azure Databricks filesystem recipe of Chapter 2, Reading and...

Checking the execution details of all the executed Spark queries via the Spark UI

In this recipe, you will learn how to view the statuses of all the running applications in your cluster. You will also look at various tasks you can use to identify if there are any issues with a specific application/query. Knowing this is useful as you will get holistic information about how your cluster is utilized in terms of tasks distribution and how your applications are running.

Getting ready

Execute the queries shown in the Introduction to jobs, stages, and tasks recipe of this chapter. You can either use a Spark 2.x (latest version) or Spark 3.x cluster.

How to do it…

Follow these steps to learn about the running applications/queries in your cluster:

  1. When you are in your Databricks workspace, click on the Clusters option and then on the cluster that you are using. Then, click on the Spark UI tab, as shown in the following screenshot:

    Figure 3.10 – Spark UI screen...

Deep diving into schema inference

In this recipe, you will learn about the benefits of explicitly specifying a schema while reading any file format data from an ADLS Gen-2 or Azure Blob storage account.

By the end of this recipe, you will have learned how Spark executes a query when a schema is inferred versus explicitly specified.

Getting ready

You need to ensure you have done the following before you start working on this recipe:

How to do it…

You can...

Looking into the query execution plan

It's important to understand the execution plan and how to view its different stages when the Spark optimizer executes a query using a dataframe or the SparkSQL API.

In this recipe, we will learn how to create a logical and physical plan and the different stages involved. By the end of this recipe, you will have generated an execution plan using the dataframe API or the SparkSQL API and have a fair understanding of the different stages involved.

Getting ready

You can follow along by running the steps in the 3-4.Query Execution Plan notebook in your local cloned repository, which can be found in the Chapter03 folder (https://github.com/PacktPublishing/Azure-Databricks-Cookbook/tree/main/Chapter03).

Upload the csvFiles folder in the Common/Customer folder (https://github.com/PacktPublishing/Azure-Databricks-Cookbook/tree/main/Common/Customer/csvFiles) to your ADLS Gen-2 account. This can be found in the rawdata filesystem, inside...

How joins work in Spark 

In this recipe, you will learn how query joins are executed by the Spark optimizer using different types of sorting algorithms such as SortMerge and BroadcastHash joins. You will learn how to identify which algorithm has been used by looking at the DAG that Spark generates. You will also learn how to use the hints that are provided in the queries to influence the optimizer to use a specific join algorithm.

Getting ready

To follow along with this recipe, run the cells in the 3-5.Joins notebook, which you can find in your local cloned repository, in the Chapter03 folder (https://github.com/PacktPublishing/Azure-Databricks-Cookbook/tree/main/Chapter03).

Upload the csvFiles folders, which can be found in the Common/Customer and Common/Orders folders in your local cloned repository, to the ADLS Gen-2 account in the rawdata filesystem. You will need to create two folders called Customer and Orders in the rawdata filesystem:

Figure...

Learning about input partitions 

Partitions are subsets of files in memory or storage. In Spark, partitions are more utilized compared to the Hive system or SQL databases. Spark uses partitions for parallel processing and to gain maximum performance.

Spark and Hive partitions are different; Spark processes data in memory, whereas Hive partitions are in storage. In this recipe, we will cover three different partitions; that is, the input, shuffle, and output partitions.

Let's start by looking at input partitions.

Getting ready

Apache Spark has a layered architecture, and the driver nodes communicate with the worker nodes to get the job done. All the data processing happens in the worker nodes. When the job is submitted for processing, each data partition is sent to the specific executors. Each executor processes one partition at a time. Hence, the time it takes each executor to process data is directly proportional to the size and number of partitions. The more...

Learning about output partitions 

Saving partitioned data using the proper condition can significantly boost performance while you're reading and retrieving data for further processing.

Reading the required partition limits the number of files and partitions that Spark reads while querying data. It also helps with dynamic partition pruning.

But sometimes, too many optimizations can make things worse. For example, if you have several partitions, data is scattered within multiple files, so searching the data for particular conditions in the initial query can take time. Also, memory utilization will be more while processing the metadata table as it contains several partitions.

While saving the in-memory data to disk, you must consider the partition sizes as Spark produces files for each task. Let's consider a scenario: if the cluster configuration has more memory for processing the dataframe and saving it as larger partition sizes, then processing the same data...

Learning about shuffle partitions

In this recipe, you will learn how to set the spark.sql.shuffle.partitions parameter and see the impact it has on performance when there are fewer partitions.

Most of the time, in the case of wide transformations, where data is required from other partitions, Spark performs a data shuffle. Unfortunately, you can't avoid such transformations, but we can configure parameters to reduce the impact this has on performance.

Wide transformations uses shuffle partitions to shuffle data. However, irrespective of the data's size or the number of executors, the number of partitions is set to 200.

The data shuffle procedure is triggered by data transformations such as join(), union(), groupByKey(), reduceBykey(), and so on. The spark.sql.shuffle.partitions configuration sets the number of partitions to use during data shuffling. The partition numbers are set to 200 by default when Spark performs data shuffling.

Getting ready

You can...

Storage benefits of different file types

Storage formats are a way to define how data is stored in a file. Hadoop doesn't have a default file format, but it supports multiple file formats for storing data. Some of the common storage formats for Hadoop are as follows:

  • Text files
  • Sequence files
  • Parquet files
  • Record-columnar (RC) files
  • Optimized row columnar (ORC) files
  • Avro files

Choosing a write file format will provide significant advantages, such as the following:

  • Optimized performance while reading and writing data
  • Schema evaluation support (allows us to change the attributes in a dataset)
  • Higher compression, resulting in less storage space being required
  • Splittable files (files can be read in parts)

Let's focus on columnar storage formats as they are widely used in big data applications because of how they store data and can be queried by the SQL engine. The columnar format is very useful when a subset of data...

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Azure Databricks Cookbook
Published in: Sep 2021Publisher: PacktISBN-13: 9781789809718
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Authors (2)

author image
Phani Raj

Phani Raj is an experienced data architect and a product manager having 15 years of experience working with customers on building data platforms on both on-prem and on cloud. Worked on designing and implementing large scale big data solutions for customers on different verticals. His passion for continuous learning and adapting to the dynamic nature of technology underscores his role as a trusted advisor in the realm of data architecture ,data science and product management.
Read more about Phani Raj

author image
Vinod Jaiswal

Vinod Jaiswal is an experienced data engineer, excels in transforming raw data into valuable insights. With over 8 years in Databricks, he designs and implements data pipelines, optimizes workflows, and crafts scalable solutions for intricate data challenges. Collaborating seamlessly with diverse teams, Vinod empowers them with tools and expertise to leverage data effectively. His dedication to staying updated on the latest data engineering trends ensures cutting-edge, robust solutions. Apart from technical prowess, Vinod is a proficient educator. Through presentations and mentoring, he shares his expertise, enabling others to harness the power of data within the Databricks ecosystem.
Read more about Vinod Jaiswal