Packt+ | Advance your knowledge in tech

You're reading from Learning Spark SQL Architect streaming analytics and machine learning solutions

Product type Paperback

Published in Sep 2017

Publisher Packt

ISBN-13 9781785888359

Length 452 pages

Edition 1st Edition

Languages

Scala

Tools

Apache Spark

Concepts

Data Streaming

Author (1):

Sarkar

View More author details

Table of Contents (13) Chapters

Preface

1. Getting Started with Spark SQL

2. Using Spark SQL for Processing Structured and Semistructured Data FREE CHAPTER

3. Using Spark SQL for Data Exploration

4. Using Spark SQL for Data Munging

5. Using Spark SQL in Streaming Applications

6. Using Spark SQL in Machine Learning Applications

7. Using Spark SQL in Graph Applications

8. Using Spark SQL with SparkR

9. Developing Applications with Spark SQL

10. Using Spark SQL in Deep Learning Applications

11. Tuning Spark SQL Components for Performance

12. Spark SQL in Large-Scale Application Architectures

Understanding performance improvements using whole-stage code generation

In this section, we first present a high-level of whole-stage generation in Spark SQL, followed by a set of examples to show improvements in various JOINs using Catalyst's code generation feature.

After we have an optimized query plan, it needs to be converted to a DAG of RDDs for execution on the cluster. We use this example to explain the basic concepts of Spark SQL whole-stage code generation:

scala> sql("select count(*) from orders where customer_id = 26333955").explain() 
 
== Optimized Logical Plan == 
Aggregate [count(1) AS count(1)#45L] 
+- Project 
   +- Filter (isnotnull(customer_id#42L) && (customer_id#42L = 
              26333955)) 
      +- Relation[customer_id#42L,good_id#43L] parquet

The preceding optimized logical plan can be viewed as a sequence of Scan, Filter, Project, and Aggregate operations, as shown in the following figure: