Understanding performance improvements using whole-stage code generation
In this section, we first present a high-level of whole-stage generation in Spark SQL, followed by a set of examples to show improvements in various JOINs using Catalyst's code generation feature.
After we have an optimized query plan, it needs to be converted to a DAG of RDDs for execution on the cluster. We use this example to explain the basic concepts of Spark SQL whole-stage code generation:
scala> sql("select count(*) from orders where customer_id = 26333955").explain()
== Optimized Logical Plan ==
Aggregate [count(1) AS count(1)#45L]
+- Project
+- Filter (isnotnull(customer_id#42L) && (customer_id#42L =
26333955))
+- Relation[customer_id#42L,good_id#43L] parquet The preceding optimized logical plan can be viewed as a sequence of Scan, Filter, Project, and Aggregate operations, as shown in the following figure:

Traditional databases will typically execute the preceding...