You're reading from Essential PySpark for Scalable Data Analytics

Product typeBook

Published inOct 2021

Reading LevelBeginner

PublisherPackt

ISBN-139781800568877

Edition1st Edition

Languages

Python

Tools

PySpark

Concepts

Big Data

Author (1)

Sreeram Nudurupati

Chapter 12: Spark SQL Primer

In the previous chapter, you learned about data visualizations as a powerful and key tool of data analytics. You also learned about various Python visualization libraries that can be used to visualize data in pandas DataFrames. An equally important and ubiquitous and essential skill in any data analytics professional's repertoire is Structured Query Language or SQL. SQL has existed as long as the field of data analytics has existed, and even with the advent of big data, data science, and machine learning (ML), SQL is still proving to be indispensable.

This chapter introduces you to the basics of SQL and looks at how SQL can be applied in a distributed computing setting via Spark SQL. You will learn about the various components that make up Spark SQL, including the storage, metastore, and the actual query execution engine. We will look at the differences between Hadoop Hive and Spark SQL, and finally, end with some techniques for improving the performance...

Technical requirements

Here is what you'll need for this chapter:

In this chapter, we will be using Databricks Community Edition to run our code (https://community.cloud.databricks.com). Sign-up instructions can be found at https://databricks.com/try-databricks.
The code and data used in this chapter can be downloaded from https://github.com/PacktPublishing/Essential-PySpark-for-Scalable-Data-Analytics/tree/main/Chapter12.

Introduction to SQL

SQL is a declarative language for storing, manipulating, and querying data stored in relational databases, also called relational database management systems (RDBMSes). A relational database contains data in tables, which in turn contain rows and columns. In the real world, entities have relationships among themselves, and a relational database tries to mimic these real-world relationships as relationships between tables. Thus, in relational databases, individual tables contain data related to individual entities, and these tables might be related.

SQL is a declarative programming language that helps you specify which rows and columns you want to retrieve from a given table and specify constraints to filter out any data. An RDBMS contains a query optimizer that turns a SQL declaration into a query plan and executes it on the database engine. The query plan is finally translated into an execution plan for the database engine to read table rows and columns into...

Introduction to Spark SQL

Spark SQL brings native support for SQL to Apache Spark and unifies the process of querying data stored both in Spark DataFrames and in external data sources. Spark SQL unifies DataFrames and relational tables and makes it easy for developers to intermix SQL commands with querying external data for complex analytics. With the release of Apache Spark 1.3, Spark DataFrames powered by Spark SQL became the de facto abstraction of Spark for expressing data processing code, while resilient distributed datasets (RDDs) still remain Spark's core abstraction method, as shown in the following diagram:

Figure 12.2 – Spark SQL architecture

As shown in the previous diagram, you can see that most of Spark's components now leverage Spark SQL and DataFrames. Spark SQL provides more information about the structure of the data and the computation being performed, and the Spark SQL engine uses this extra information to perform additional...

Spark SQL language reference

Being a part of the overarching Hadoop ecosystem, Spark has traditionally been Hive-compliant. While the Hive query language diverges greatly from ANSI SQL standards, Spark 3.0 Spark SQL can be made ANSI SQL-compliant using a spark.sql.ansi.enabled configuration. With this configuration enabled, Spark SQL uses an ANSI SQL-compliant dialect instead of a Hive dialect.

Even with ANSI SQL compliance enabled, Spark SQL may not entirely conform to ANSI SQL dialect, and in this section, we will explore some of the prominent DDL and DML syntax of Spark SQL.

Spark SQL DDL

The syntax for creating a database and a table using Spark SQL is presented as follows:

CREATE DATABASE IF NOT EXISTS feature_store;
CREATE TABLE IF NOT EXISTS feature_store.retail_features
USING DELTA
LOCATION '/FileStore/shared_uploads/delta/retail_features.delta';

In the previous code block, we do the following:

First, we create a database if it doesn't...

Optimizing Spark SQL performance

In the previous section, you learned how the Catalyst optimizer optimizes user code by running the code through a set of optimization steps until an optimal execution plan is derived. To take advantage of the Catalyst optimizer, it is recommended to use Spark code that leverages the Spark SQL engine—that is, Spark SQL and DataFrame APIs—and avoid using RDD-based Spark code as much as possible. The Catalyst optimizer has no visibility into UDFs, thus users could end up writing sub-optimal code that might degrade performance. Thus, it is recommended to use built-in functions instead of UDFs or to define functions in Scala and Java and then use them in SQL and Python APIs.

Though Spark SQL supports file-based formats such as CSV and JSON, it is recommended to use serialized data formats such as Parquet, AVRO, and ORC. Semi-structured formats such as CSV or JSON incur performance costs, firstly during the schema inference phase, as they...

Summary

In this chapter, you learned about SQL as a declarative language that has been universally accepted as the language for structured data analysis because of its ease of use and expressiveness. You learned about the basic constructions of SQL, including the DDL and DML dialects of SQL. You were introduced to the Spark SQL engine as the unified distributed query engine that powers both Spark SQL and DataFrame APIs. SQL optimizers, in general, were introduced, and Spark's very own query optimizer Catalyst was also presented, along with its inner workings as to how it takes a Spark SQL query and converts it into Java JVM bytecode. A reference to the Spark SQL language was also presented, along with the most important DDL and DML statements, with examples. Finally, a few performance optimizations techniques were also discussed to help you get the best out of Spark SQL for all your data analysis needs. In the next chapter, we will extend our Spark SQL knowledge and see how external...

The rest of the chapter is locked

You have been reading a chapter from

Essential PySpark for Scalable Data Analytics

Published in: Oct 2021Publisher: PacktISBN-13: 9781800568877

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at $15.99/month. Cancel anytime

Author (1)

Sreeram Nudurupati

Sreeram Nudurupati is a data analytics professional with years of experience in designing and optimizing data analytics pipelines at scale. He has a history of helping enterprises, as well as digital natives, build optimized analytics pipelines by using the knowledge of the organization, infrastructure environment, and current technologies.
Read more about Sreeram Nudurupati

Personalised recommendations for you

Based on your interests and search pattern

Et al.

Ever wonder why speech recognition systems don't understand the Scottish accent, or what would happen if an astronaut only ate mac 'n' cheese, or other spurious reflections you'd have at a bar? We did, then collated those deliberations into absurd research articles with fake figures and methodologies inspired by even more fictionally absurd studies.

BookAug 2023230 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages4

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages1

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Mastering Tableau 2023

This book is a comprehensive resource to mastering your Tableau skills and becoming a BI expert. As you progress, you will learn how to build advanced dashboards and improve your storytelling to derive key business insight, as well as make you well-versed with advanced functionalities of Tableau in the business intelligence domain.

BookAug 2023684 pages

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages5

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages2

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

BookOct 2023636 pages5

Modern Data Architecture on AWS

Every organization wants an agile, performant, and cost-effective data platform that meets all their current and future business needs. Purpose-built AWS analytics services and their features play a big part in building such a modern data platform. This book brings to you all the design and architectural patterns that’ll help you achieve this goal.

BookAug 2023420 pages5

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

BookDec 2023240 pages

TinyML Cookbook

With over 70 project-based recipes, the TinyML Cookbook is a practical guide that will help you to get the most out of your microcontrollers. It provides a comprehensive understanding of the theoretical foundations while giving you hands-on experience training ML models for deployment on Arduino Nano 33 BLE Sense, Raspberry Pi Pico, and SparkFun RedBoard Artemis Nano microcontrollers.

BookNov 2023664 pages