Reader small image

You're reading from  Essential PySpark for Scalable Data Analytics

Product typeBook
Published inOct 2021
Reading LevelBeginner
PublisherPackt
ISBN-139781800568877
Edition1st Edition
Languages
Tools
Concepts
Right arrow
Author (1)
Sreeram Nudurupati
Sreeram Nudurupati
author image
Sreeram Nudurupati

Sreeram Nudurupati is a data analytics professional with years of experience in designing and optimizing data analytics pipelines at scale. He has a history of helping enterprises, as well as digital natives, build optimized analytics pipelines by using the knowledge of the organization, infrastructure environment, and current technologies.
Read more about Sreeram Nudurupati

Right arrow

Optimizing Spark SQL performance

In the previous section, you learned how the Catalyst optimizer optimizes user code by running the code through a set of optimization steps until an optimal execution plan is derived. To take advantage of the Catalyst optimizer, it is recommended to use Spark code that leverages the Spark SQL engine—that is, Spark SQL and DataFrame APIs—and avoid using RDD-based Spark code as much as possible. The Catalyst optimizer has no visibility into UDFs, thus users could end up writing sub-optimal code that might degrade performance. Thus, it is recommended to use built-in functions instead of UDFs or to define functions in Scala and Java and then use them in SQL and Python APIs.

Though Spark SQL supports file-based formats such as CSV and JSON, it is recommended to use serialized data formats such as Parquet, AVRO, and ORC. Semi-structured formats such as CSV or JSON incur performance costs, firstly during the schema inference phase, as they...

lock icon
The rest of the page is locked
Previous PageNext Page
You have been reading a chapter from
Essential PySpark for Scalable Data Analytics
Published in: Oct 2021Publisher: PacktISBN-13: 9781800568877

Author (1)

author image
Sreeram Nudurupati

Sreeram Nudurupati is a data analytics professional with years of experience in designing and optimizing data analytics pipelines at scale. He has a history of helping enterprises, as well as digital natives, build optimized analytics pipelines by using the knowledge of the organization, infrastructure environment, and current technologies.
Read more about Sreeram Nudurupati