Reader small image

You're reading from  Essential PySpark for Scalable Data Analytics

Product typeBook
Published inOct 2021
Reading LevelBeginner
PublisherPackt
ISBN-139781800568877
Edition1st Edition
Languages
Tools
Concepts
Right arrow
Author (1)
Sreeram Nudurupati
Sreeram Nudurupati
author image
Sreeram Nudurupati

Sreeram Nudurupati is a data analytics professional with years of experience in designing and optimizing data analytics pipelines at scale. He has a history of helping enterprises, as well as digital natives, build optimized analytics pipelines by using the knowledge of the organization, infrastructure environment, and current technologies.
Read more about Sreeram Nudurupati

Right arrow

Chapter 12: Spark SQL Primer

In the previous chapter, you learned about data visualizations as a powerful and key tool of data analytics. You also learned about various Python visualization libraries that can be used to visualize data in pandas DataFrames. An equally important and ubiquitous and essential skill in any data analytics professional's repertoire is Structured Query Language or SQL. SQL has existed as long as the field of data analytics has existed, and even with the advent of big data, data science, and machine learning (ML), SQL is still proving to be indispensable.

This chapter introduces you to the basics of SQL and looks at how SQL can be applied in a distributed computing setting via Spark SQL. You will learn about the various components that make up Spark SQL, including the storage, metastore, and the actual query execution engine. We will look at the differences between Hadoop Hive and Spark SQL, and finally, end with some techniques for improving the performance...

Technical requirements

Here is what you'll need for this chapter:

Introduction to SQL

SQL is a declarative language for storing, manipulating, and querying data stored in relational databases, also called relational database management systems (RDBMSes). A relational database contains data in tables, which in turn contain rows and columns. In the real world, entities have relationships among themselves, and a relational database tries to mimic these real-world relationships as relationships between tables. Thus, in relational databases, individual tables contain data related to individual entities, and these tables might be related.

SQL is a declarative programming language that helps you specify which rows and columns you want to retrieve from a given table and specify constraints to filter out any data. An RDBMS contains a query optimizer that turns a SQL declaration into a query plan and executes it on the database engine. The query plan is finally translated into an execution plan for the database engine to read table rows and columns into...

Introduction to Spark SQL

Spark SQL brings native support for SQL to Apache Spark and unifies the process of querying data stored both in Spark DataFrames and in external data sources. Spark SQL unifies DataFrames and relational tables and makes it easy for developers to intermix SQL commands with querying external data for complex analytics. With the release of Apache Spark 1.3, Spark DataFrames powered by Spark SQL became the de facto abstraction of Spark for expressing data processing code, while resilient distributed datasets (RDDs) still remain Spark's core abstraction method, as shown in the following diagram:

Figure 12.2 – Spark SQL architecture

As shown in the previous diagram, you can see that most of Spark's components now leverage Spark SQL and DataFrames. Spark SQL provides more information about the structure of the data and the computation being performed, and the Spark SQL engine uses this extra information to perform additional...

Spark SQL language reference

Being a part of the overarching Hadoop ecosystem, Spark has traditionally been Hive-compliant. While the Hive query language diverges greatly from ANSI SQL standards, Spark 3.0 Spark SQL can be made ANSI SQL-compliant using a spark.sql.ansi.enabled configuration. With this configuration enabled, Spark SQL uses an ANSI SQL-compliant dialect instead of a Hive dialect.

Even with ANSI SQL compliance enabled, Spark SQL may not entirely conform to ANSI SQL dialect, and in this section, we will explore some of the prominent DDL and DML syntax of Spark SQL.

Spark SQL DDL

The syntax for creating a database and a table using Spark SQL is presented as follows:

CREATE DATABASE IF NOT EXISTS feature_store;
CREATE TABLE IF NOT EXISTS feature_store.retail_features
USING DELTA
LOCATION '/FileStore/shared_uploads/delta/retail_features.delta';

In the previous code block, we do the following:

  • First, we create a database if it doesn't...

Optimizing Spark SQL performance

In the previous section, you learned how the Catalyst optimizer optimizes user code by running the code through a set of optimization steps until an optimal execution plan is derived. To take advantage of the Catalyst optimizer, it is recommended to use Spark code that leverages the Spark SQL engine—that is, Spark SQL and DataFrame APIs—and avoid using RDD-based Spark code as much as possible. The Catalyst optimizer has no visibility into UDFs, thus users could end up writing sub-optimal code that might degrade performance. Thus, it is recommended to use built-in functions instead of UDFs or to define functions in Scala and Java and then use them in SQL and Python APIs.

Though Spark SQL supports file-based formats such as CSV and JSON, it is recommended to use serialized data formats such as Parquet, AVRO, and ORC. Semi-structured formats such as CSV or JSON incur performance costs, firstly during the schema inference phase, as they...

Summary

In this chapter, you learned about SQL as a declarative language that has been universally accepted as the language for structured data analysis because of its ease of use and expressiveness. You learned about the basic constructions of SQL, including the DDL and DML dialects of SQL. You were introduced to the Spark SQL engine as the unified distributed query engine that powers both Spark SQL and DataFrame APIs. SQL optimizers, in general, were introduced, and Spark's very own query optimizer Catalyst was also presented, along with its inner workings as to how it takes a Spark SQL query and converts it into Java JVM bytecode. A reference to the Spark SQL language was also presented, along with the most important DDL and DML statements, with examples. Finally, a few performance optimizations techniques were also discussed to help you get the best out of Spark SQL for all your data analysis needs. In the next chapter, we will extend our Spark SQL knowledge and see how external...

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Essential PySpark for Scalable Data Analytics
Published in: Oct 2021Publisher: PacktISBN-13: 9781800568877
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Sreeram Nudurupati

Sreeram Nudurupati is a data analytics professional with years of experience in designing and optimizing data analytics pipelines at scale. He has a history of helping enterprises, as well as digital natives, build optimized analytics pipelines by using the knowledge of the organization, infrastructure environment, and current technologies.
Read more about Sreeram Nudurupati