You're reading from Optimizing Databricks Workloads

Product typeBook

Published inDec 2021

PublisherPackt

ISBN-139781801819077

Edition1st Edition

Concepts

Data Processing

Authors (3):

Anirudh Kala

Anshul Bhatnagar

Sarthak Sarbahi

View More author details

Chapter 5: Big Data Analytics

Optimizations in Apache Spark play a crucial role while building big data solutions. Knowledge and experience in tuning Spark-based workloads help organizations save costs and time while running these workloads on the cloud. In this chapter, we will learn about various optimization techniques concerning Spark DataFrames and big data analytics in general. We will learn about the limitations of the collect() method and inferSchema when reading data. This will be followed by an overview of the best practices for working with CSV files, Parquet files, Pandas projects, and Koalas projects. Also, we will learn about some powerful optimization techniques, such as column predicate pushdown, column pruning, and partitioning strategies.

The topics covered in this chapter are as follows:

Understanding the collect() method
Understanding the use of inferSchema
Learning to differentiate between CSV and Parquet
Learning to differentiate between...

Technical requirements

To follow the hands-on tutorials in this chapter, you will need access to the following:

A Microsoft Azure subscription
Azure Databricks
Azure Databricks notebooks and a Spark cluster
Access to this book's GitHub repository: https://github.com/PacktPublishing/Optimizing-Databricks-Workload/tree/main/Chapter05

To start off, let's spin up a Spark cluster with the following configurations:

Cluster Mode: Standard
Databricks Runtime Version: 8.3 (includes Apache Spark 3.1.1, Scala 2.12)
Autoscaling: Disabled
Automatic Termination: After 30 minutes of inactivity
Worker Type: Standard_DS3_v2
Number of workers: 1
Spot instances: Disabled
Driver Type: Standard_DS3_v2

Now, create a new notebook and attach it to the newly created cluster to get started!

Understanding the collect() method

Spark's collect() function is an action, and it is used to retrieve all the elements of the Resilient Distributed Dataset (RDD) or DataFrame. We will first take a look at an example of using the function. Run the following code block:

from pyspark.sql.functions import *
airlines_1987_to_2008 = (
  spark
  .read
  .option("header",True)
  .option("delimiter",",")
  .option("inferSchema",True)
  .csv("dbfs:/databricks-datasets/asa/airlines/*")
)
display(airlines_1987_to_2008)

The preceding code block creates a Spark DataFrame and displays the first 1,000 records. Now, let's run some code with the collect() function:

airlines_1987_to_2008.select('Year').distinct().collect()

The preceding line of code returns a list of row objects for the Year column values. A row object is a collection of fields that can be iterated...

Understanding the use of inferSchema

The inferSchema option is very often used to make Spark infer the data types automatically. While this approach works well for smaller datasets, performance bottlenecks can develop as the size of the data being scanned increases. In order to better understand the challenges that come with using this option for big data, we will perform a couple of experiments.

Experiment 1

In this experiment, we will re-run the code block that we ran in the previous section:

airlines_1987_to_2008 = (
  spark
  .read
  .option("header",True)
  .option("delimiter",",")
  .option("inferSchema",True)
  .csv("dbfs:/databricks-datasets/asa/airlines/*")
)
display(airlines_1987_to_2008)

The code block simply reads CSV files and creates a Spark DataFrame by automatically inferring the schema. Note the time it takes for the job to run. For us, it took...

Learning to differentiate CSV and Parquet

Data scientists are more used to CSV files than Parquet files in the majority of the cases. When they are starting to use Databricks and Spark, it becomes quite obvious that they'll continue working with CSV files. Making that switch to Parquet might be daunting at first, but in the long run, it reaps huge returns!

Let's first discuss the advantages and disadvantages of CSV and Parquet files:

Advantages of CSV files:

CSV is the most common file type among data scientists and users.
They are human-readable, as data is not encoded before storing. They are also easy to edit.
Parsing CSV files is very easy, and they can be read by almost any text editor.

Advantages of Parquet files:

Parquet files are compressed using various compression algorithms, which is why they consume less space.
Being a columnar storage type, Parquet files are very efficient when reading and querying data.
The file...

Learning to differentiate Pandas and Koalas

The Pandas project is a very popular data transformation library in Python that is widely used for data analytics and data science purposes. Put simply, it's the bread and butter of data science for the majority of data scientists. But there are some limitations with the Pandas project. It is not really built for working with big data and distributed datasets. Pandas code, when executed in Databricks, only runs on the driver. This creates a performance bottleneck when the data size increases.

On the other hand, when data analysts and data scientists start working with Spark, they need to be using PySpark as an alternative. Due to this challenge, the creators of Databricks came up with another project and named it Koalas. This project has been built to allow data scientists working with Pandas to become productive with Apache Spark. It is nothing but a Pandas DataFrame API built on top of Apache Spark. Therefore, it leverages very...

Understanding built-in Spark functions

Spark gives us several built-in functions for working with DataFrames. These functions are built in such a way that they can be optimized by the catalyst optimizer. The catalyst optimizer is an essential component of the Spark program that helps to optimize our code using advanced programming constructs. It works very well with Spark DataFrames and built-in functions (higher-order functions). However, in the case of a UDF, the catalyst optimizer treats it as a black box. As a result, we see performance bottlenecks.

To learn about all the built-in functions in PySpark, check out the official documentation:

In the following example, we are going to see performance differences between Spark higher-order functions and UDFs:

Let's begin by creating a Spark DataFrame in a new cell:

from pyspark.sql.types import *
manual_schema = StructType([
  StructField('Year',IntegerType(),True),
  StructField(...

Learning column predicate pushdown

Column predicate pushdown is an optimization technique where we filter down to the level of the data source to reduce the amount of data getting scanned. This greatly enhances jobs, as Spark only reads the data that is needed for operations. For example, if we are reading from a Postgres database, we can push down a filter to the database to ensure that Spark only reads the required data. The same can be applied to Parquet and delta files as well. While writing Parquet and delta files to the storage account, we can partition them by one or more columns. And while reading, we can push down a filter to read only the required partitions.

In the following steps, we will look at an example of column predicate pushdown with Parquet files:

To get started, we will re-create our airlines DataFrame in a new cell:

from pyspark.sql.types import *
manual_schema = StructType([
  StructField('Year',IntegerType(),True),
  StructField...

Learning partitioning strategies in Spark

In this section, we will discuss some of the useful strategies for Spark partitions and Apache Hive partitions. Whenever Spark processes data in memory, it breaks that data down into partitions, and these partitions are processed in the cores of the executors. These are the Spark partitions. On the other hand, Hive partitions help to organize persisted tables into parts based on columns.

Understanding Spark partitions

Before we learn about the strategies to manage Spark partitions, we need to know the number of partitions for any given DataFrame:

To check the Spark partitions of a given DataFrame, we use the following syntax: dataframe.rdd.getNumPartitions(). Also, remember that the total number of tasks doing work on a Spark DataFrame is equal to the total number of partitions of that DataFrame.
Next, we will learn how to check the number of records in each Spark partition. We will begin with re-creating the airlines DataFrame...

Understanding Spark SQL optimizations

In this section, we will learn about how to write efficient Spark SQL queries, along with tips to help optimize the existing SQL queries:

Avoid using NOT IN in the SQL queries, as it is a very expensive operation.
Filter the data before performing join operations by using the WHERE clause before joining the tables.
Mention the column name when using the SELECT clause instead of giving a * to select all of them. Try to use the columns required for operations instead of selecting all of them unnecessarily.
Avoid using LIKE in the WHERE clause, as it is another expensive operation.
Try not to join the same set of tables multiple times. Instead, write a common table expression (CTE) using the WITH clause to create a subquery, and use it to join the tables wherever necessary.
When joining the same table for different conditions, use the CASE statements.

In the next and final section of this chapter, we will learn about...

Understanding bucketing in Spark

Bucketing is an optimization technique that helps to prevent shuffling and sorting of data during compute-heavy operations such as joins. Based on the bucketing columns we specify, data is collected in a number of bins. Bucketing is similar to partitioning, but in the case of partitioning, we create directories for each partition. In bucketing, we create equal-sized buckets, and data is distributed across these buckets by a hash on the value of the bucket. Partitioning is helpful when filtering data, whereas bucketing is more helpful during joins.

It is often helpful to perform bucketing on dimension tables that contain primary keys for joining. Bucketing is also helpful when join operations are being performed between small and large tables. In this section, we will go through a quick example to understand how to implement bucketing on a Hive table:

We will begin by creating a Spark DataFrame in a new cell. Run the following code block:
```
from...
```

Summary

In this chapter, we learned about several useful techniques to optimize Spark jobs when working with Spark DataFrames. We started by learning about the collect() method and when to avoid using it, and ended with a discussion of some SQL optimization best practices and bucketing. We also learned about why Parquet files and Koalas should be adopted by data scientists using Databricks.

In the next chapter, we will learn about some of the most powerful optimization techniques with Delta Lake. We will develop a theoretical understanding of these optimizations, and we'll write code to understand their practical use in different scenarios.

The rest of the chapter is locked

You have been reading a chapter from

Optimizing Databricks Workloads

Published in: Dec 2021Publisher: PacktISBN-13: 9781801819077

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at $15.99/month. Cancel anytime

Authors (3)

Anirudh Kala

Anirudh Kala is an expert in machine learning techniques, artificial intelligence, and natural language processing. He has helped multiple organizations to run their large-scale data warehouses with quantitative research, natural language generation, data science exploration, and big data implementation. He has worked in every aspect of data analytics using the Azure data platform. Currently, he works as the director of Celebal Technologies, a data science boutique firm dedicated to large-scale analytics. Anirudh holds a computer engineering degree from the University of Rajasthan and his work history features the likes of IBM and ZS Associates.
Read more about Anirudh Kala

Anshul Bhatnagar

Anshul Bhatnagar is an experienced, hands-on data architect involved in the architecture, design, and implementation of data platform architectures, and distributed systems. He has worked in the IT industry since 2015 in a range of roles such as Hadoop/Spark developer, data engineer, and data architect. He has also worked in many other sectors including energy, media, telecoms, and e-commerce. He is currently working for a data and AI boutique company, Celebal Technologies, in India. He is always keen to hear about new ideas and technologies in the areas of big data and AI, so look him up on LinkedIn to ask questions or just to say hi.
Read more about Anshul Bhatnagar

Sarthak Sarbahi

Sarthak Sarbahi is a certified data engineer and analyst with a wide technical breadth and a deep understanding of Databricks. His background has led him to a variety of cloud data services with an eye toward data warehousing, big data analytics, robust data engineering, data science, and business intelligence. Sarthak graduated with a degree in mechanical engineering.
Read more about Sarthak Sarbahi

Personalised recommendations for you

Based on your interests and search pattern

Et al.

Ever wonder why speech recognition systems don't understand the Scottish accent, or what would happen if an astronaut only ate mac 'n' cheese, or other spurious reflections you'd have at a bar? We did, then collated those deliberations into absurd research articles with fake figures and methodologies inspired by even more fictionally absurd studies.

BookAug 2023230 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages4

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages1

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Mastering Tableau 2023

This book is a comprehensive resource to mastering your Tableau skills and becoming a BI expert. As you progress, you will learn how to build advanced dashboards and improve your storytelling to derive key business insight, as well as make you well-versed with advanced functionalities of Tableau in the business intelligence domain.

BookAug 2023684 pages

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages5

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages2

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

BookOct 2023636 pages5

Modern Data Architecture on AWS

Every organization wants an agile, performant, and cost-effective data platform that meets all their current and future business needs. Purpose-built AWS analytics services and their features play a big part in building such a modern data platform. This book brings to you all the design and architectural patterns that’ll help you achieve this goal.

BookAug 2023420 pages5

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

BookDec 2023240 pages

TinyML Cookbook

With over 70 project-based recipes, the TinyML Cookbook is a practical guide that will help you to get the most out of your microcontrollers. It provides a comprehensive understanding of the theoretical foundations while giving you hands-on experience training ML models for deployment on Arduino Nano 33 BLE Sense, Raspberry Pi Pico, and SparkFun RedBoard Artemis Nano microcontrollers.

BookNov 2023664 pages