Reader small image

You're reading from  Optimizing Databricks Workloads

Product typeBook
Published inDec 2021
PublisherPackt
ISBN-139781801819077
Edition1st Edition
Right arrow
Authors (3):
Anirudh Kala
Anirudh Kala
author image
Anirudh Kala

Anirudh Kala is an expert in machine learning techniques, artificial intelligence, and natural language processing. He has helped multiple organizations to run their large-scale data warehouses with quantitative research, natural language generation, data science exploration, and big data implementation. He has worked in every aspect of data analytics using the Azure data platform. Currently, he works as the director of Celebal Technologies, a data science boutique firm dedicated to large-scale analytics. Anirudh holds a computer engineering degree from the University of Rajasthan and his work history features the likes of IBM and ZS Associates.
Read more about Anirudh Kala

Anshul Bhatnagar
Anshul Bhatnagar
author image
Anshul Bhatnagar

Anshul Bhatnagar is an experienced, hands-on data architect involved in the architecture, design, and implementation of data platform architectures, and distributed systems. He has worked in the IT industry since 2015 in a range of roles such as Hadoop/Spark developer, data engineer, and data architect. He has also worked in many other sectors including energy, media, telecoms, and e-commerce. He is currently working for a data and AI boutique company, Celebal Technologies, in India. He is always keen to hear about new ideas and technologies in the areas of big data and AI, so look him up on LinkedIn to ask questions or just to say hi.
Read more about Anshul Bhatnagar

Sarthak Sarbahi
Sarthak Sarbahi
author image
Sarthak Sarbahi

Sarthak Sarbahi is a certified data engineer and analyst with a wide technical breadth and a deep understanding of Databricks. His background has led him to a variety of cloud data services with an eye toward data warehousing, big data analytics, robust data engineering, data science, and business intelligence. Sarthak graduated with a degree in mechanical engineering.
Read more about Sarthak Sarbahi

View More author details
Right arrow

Chapter 5: Big Data Analytics

Optimizations in Apache Spark play a crucial role while building big data solutions. Knowledge and experience in tuning Spark-based workloads help organizations save costs and time while running these workloads on the cloud. In this chapter, we will learn about various optimization techniques concerning Spark DataFrames and big data analytics in general. We will learn about the limitations of the collect() method and inferSchema when reading data. This will be followed by an overview of the best practices for working with CSV files, Parquet files, Pandas projects, and Koalas projects. Also, we will learn about some powerful optimization techniques, such as column predicate pushdown, column pruning, and partitioning strategies.

The topics covered in this chapter are as follows:

  • Understanding the collect() method
  • Understanding the use of inferSchema
  • Learning to differentiate between CSV and Parquet
  • Learning to differentiate between...

Technical requirements

To follow the hands-on tutorials in this chapter, you will need access to the following:

To start off, let's spin up a Spark cluster with the following configurations:

  • Cluster Mode: Standard
  • Databricks Runtime Version: 8.3 (includes Apache Spark 3.1.1, Scala 2.12)
  • Autoscaling: Disabled
  • Automatic Termination: After 30 minutes of inactivity
  • Worker Type: Standard_DS3_v2
  • Number of workers: 1
  • Spot instances: Disabled
  • Driver Type: Standard_DS3_v2

Now, create a new notebook and attach it to the newly created cluster to get started!

Understanding the collect() method

Spark's collect() function is an action, and it is used to retrieve all the elements of the Resilient Distributed Dataset (RDD) or DataFrame. We will first take a look at an example of using the function. Run the following code block:

from pyspark.sql.functions import *
airlines_1987_to_2008 = (
  spark
  .read
  .option("header",True)
  .option("delimiter",",")
  .option("inferSchema",True)
  .csv("dbfs:/databricks-datasets/asa/airlines/*")
)
display(airlines_1987_to_2008)

The preceding code block creates a Spark DataFrame and displays the first 1,000 records. Now, let's run some code with the collect() function:

airlines_1987_to_2008.select('Year').distinct().collect()

The preceding line of code returns a list of row objects for the Year column values. A row object is a collection of fields that can be iterated...

Understanding the use of inferSchema

The inferSchema option is very often used to make Spark infer the data types automatically. While this approach works well for smaller datasets, performance bottlenecks can develop as the size of the data being scanned increases. In order to better understand the challenges that come with using this option for big data, we will perform a couple of experiments.

Experiment 1

In this experiment, we will re-run the code block that we ran in the previous section:

airlines_1987_to_2008 = (
  spark
  .read
  .option("header",True)
  .option("delimiter",",")
  .option("inferSchema",True)
  .csv("dbfs:/databricks-datasets/asa/airlines/*")
)
display(airlines_1987_to_2008)

The code block simply reads CSV files and creates a Spark DataFrame by automatically inferring the schema. Note the time it takes for the job to run. For us, it took...

Learning to differentiate CSV and Parquet

Data scientists are more used to CSV files than Parquet files in the majority of the cases. When they are starting to use Databricks and Spark, it becomes quite obvious that they'll continue working with CSV files. Making that switch to Parquet might be daunting at first, but in the long run, it reaps huge returns!

Let's first discuss the advantages and disadvantages of CSV and Parquet files:

Advantages of CSV files:

  • CSV is the most common file type among data scientists and users.
  • They are human-readable, as data is not encoded before storing. They are also easy to edit.
  • Parsing CSV files is very easy, and they can be read by almost any text editor.

Advantages of Parquet files:

  • Parquet files are compressed using various compression algorithms, which is why they consume less space.
  • Being a columnar storage type, Parquet files are very efficient when reading and querying data.
  • The file...

Learning to differentiate Pandas and Koalas

The Pandas project is a very popular data transformation library in Python that is widely used for data analytics and data science purposes. Put simply, it's the bread and butter of data science for the majority of data scientists. But there are some limitations with the Pandas project. It is not really built for working with big data and distributed datasets. Pandas code, when executed in Databricks, only runs on the driver. This creates a performance bottleneck when the data size increases.

On the other hand, when data analysts and data scientists start working with Spark, they need to be using PySpark as an alternative. Due to this challenge, the creators of Databricks came up with another project and named it Koalas. This project has been built to allow data scientists working with Pandas to become productive with Apache Spark. It is nothing but a Pandas DataFrame API built on top of Apache Spark. Therefore, it leverages very...

Understanding built-in Spark functions

Spark gives us several built-in functions for working with DataFrames. These functions are built in such a way that they can be optimized by the catalyst optimizer. The catalyst optimizer is an essential component of the Spark program that helps to optimize our code using advanced programming constructs. It works very well with Spark DataFrames and built-in functions (higher-order functions). However, in the case of a UDF, the catalyst optimizer treats it as a black box. As a result, we see performance bottlenecks.

To learn about all the built-in functions in PySpark, check out the official documentation:

In the following example, we are going to see performance differences between Spark higher-order functions and UDFs:

  1. Let's begin by creating a Spark DataFrame in a new cell:
    from pyspark.sql.types import *
    manual_schema = StructType([
      StructField('Year',IntegerType(),True),
      StructField(...

Learning column predicate pushdown

Column predicate pushdown is an optimization technique where we filter down to the level of the data source to reduce the amount of data getting scanned. This greatly enhances jobs, as Spark only reads the data that is needed for operations. For example, if we are reading from a Postgres database, we can push down a filter to the database to ensure that Spark only reads the required data. The same can be applied to Parquet and delta files as well. While writing Parquet and delta files to the storage account, we can partition them by one or more columns. And while reading, we can push down a filter to read only the required partitions.

In the following steps, we will look at an example of column predicate pushdown with Parquet files:

  1. To get started, we will re-create our airlines DataFrame in a new cell:
    from pyspark.sql.types import *
    manual_schema = StructType([
      StructField('Year',IntegerType(),True),
      StructField...

Learning partitioning strategies in Spark

In this section, we will discuss some of the useful strategies for Spark partitions and Apache Hive partitions. Whenever Spark processes data in memory, it breaks that data down into partitions, and these partitions are processed in the cores of the executors. These are the Spark partitions. On the other hand, Hive partitions help to organize persisted tables into parts based on columns.

Understanding Spark partitions

Before we learn about the strategies to manage Spark partitions, we need to know the number of partitions for any given DataFrame:

  1. To check the Spark partitions of a given DataFrame, we use the following syntax: dataframe.rdd.getNumPartitions(). Also, remember that the total number of tasks doing work on a Spark DataFrame is equal to the total number of partitions of that DataFrame.
  2. Next, we will learn how to check the number of records in each Spark partition. We will begin with re-creating the airlines DataFrame...

Understanding Spark SQL optimizations

In this section, we will learn about how to write efficient Spark SQL queries, along with tips to help optimize the existing SQL queries:

  • Avoid using NOT IN in the SQL queries, as it is a very expensive operation.
  • Filter the data before performing join operations by using the WHERE clause before joining the tables.
  • Mention the column name when using the SELECT clause instead of giving a * to select all of them. Try to use the columns required for operations instead of selecting all of them unnecessarily.
  • Avoid using LIKE in the WHERE clause, as it is another expensive operation.
  • Try not to join the same set of tables multiple times. Instead, write a common table expression (CTE) using the WITH clause to create a subquery, and use it to join the tables wherever necessary.
  • When joining the same table for different conditions, use the CASE statements.

In the next and final section of this chapter, we will learn about...

Understanding bucketing in Spark

Bucketing is an optimization technique that helps to prevent shuffling and sorting of data during compute-heavy operations such as joins. Based on the bucketing columns we specify, data is collected in a number of bins. Bucketing is similar to partitioning, but in the case of partitioning, we create directories for each partition. In bucketing, we create equal-sized buckets, and data is distributed across these buckets by a hash on the value of the bucket. Partitioning is helpful when filtering data, whereas bucketing is more helpful during joins.

It is often helpful to perform bucketing on dimension tables that contain primary keys for joining. Bucketing is also helpful when join operations are being performed between small and large tables. In this section, we will go through a quick example to understand how to implement bucketing on a Hive table:

  1. We will begin by creating a Spark DataFrame in a new cell. Run the following code block:
    from...

Summary

In this chapter, we learned about several useful techniques to optimize Spark jobs when working with Spark DataFrames. We started by learning about the collect() method and when to avoid using it, and ended with a discussion of some SQL optimization best practices and bucketing. We also learned about why Parquet files and Koalas should be adopted by data scientists using Databricks.

In the next chapter, we will learn about some of the most powerful optimization techniques with Delta Lake. We will develop a theoretical understanding of these optimizations, and we'll write code to understand their practical use in different scenarios.

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Optimizing Databricks Workloads
Published in: Dec 2021Publisher: PacktISBN-13: 9781801819077
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Authors (3)

author image
Anirudh Kala

Anirudh Kala is an expert in machine learning techniques, artificial intelligence, and natural language processing. He has helped multiple organizations to run their large-scale data warehouses with quantitative research, natural language generation, data science exploration, and big data implementation. He has worked in every aspect of data analytics using the Azure data platform. Currently, he works as the director of Celebal Technologies, a data science boutique firm dedicated to large-scale analytics. Anirudh holds a computer engineering degree from the University of Rajasthan and his work history features the likes of IBM and ZS Associates.
Read more about Anirudh Kala

author image
Anshul Bhatnagar

Anshul Bhatnagar is an experienced, hands-on data architect involved in the architecture, design, and implementation of data platform architectures, and distributed systems. He has worked in the IT industry since 2015 in a range of roles such as Hadoop/Spark developer, data engineer, and data architect. He has also worked in many other sectors including energy, media, telecoms, and e-commerce. He is currently working for a data and AI boutique company, Celebal Technologies, in India. He is always keen to hear about new ideas and technologies in the areas of big data and AI, so look him up on LinkedIn to ask questions or just to say hi.
Read more about Anshul Bhatnagar

author image
Sarthak Sarbahi

Sarthak Sarbahi is a certified data engineer and analyst with a wide technical breadth and a deep understanding of Databricks. His background has led him to a variety of cloud data services with an eye toward data warehousing, big data analytics, robust data engineering, data science, and business intelligence. Sarthak graduated with a degree in mechanical engineering.
Read more about Sarthak Sarbahi