Chapter 5. SparkSQL, DataFrames, and DataSets
In this chapter, we'll spend some time talking about SparkSQL. This is becoming an increasingly important part of Spark; it basically lets you deal with structured data formats. This means that instead of the RDDs that contain arbitrary information in every row, we're going to give the rows some structure. This will let us do a lot of different things, such as treat our RDDs as little databases. So, we're going to call them DataFrames and DataSets from now on, and you can actually perform SQL queries and SQL-like operations on them, which can be pretty powerful.
What is structured data? Basically, it means that when we extend the concept of an RDD to a DataFrame object, we provide the data in the RDD with some structure.
One way to think of it is that it's fundamentally an RDD of row objects. By doing this, we can construct SQL queries. We can have distinct columns in these rows, and we can actually form SQL queries and issue commands in a SQL-like style, which we'll see shortly. Because we have an actual schema associated with the DataFrame, it means that Spark can actually do even more optimization than what it normally would. So, it can do query optimization, just like you would on a SQL database, when it tries to figure out the optimal plan for executing your Spark script. Another nice thing is that you can directly read and write to JSON files or JDBC-compiled and compliant databases. This means that if you do have your source data that's already in a structured format, for example, inside a relational database or inside...
Executing SQL commands and SQL-style functions on a DataFrame
Alright, open up the sparksql.py
file that's included in the download files for this book. Let's take a look at it as a real-world example of using SparkSQL in Spark 2.0. You should see the following code in your editor:
Notice that we're importing a few things here. We're importing the SparkSession
object and the Row
object. The SparkSession
object is basically Spark 2.0's way of creating a context to work with SparkSQL. We'll also import collections here:
from pyspark.sql import SparkSession
from pyspark.sql import Row
import collections
Earlier, we used to create sparkContext
objects, but now, we'll create a SparkSession
object:
# Create a SparkSession (Note, the config section is only for Windows!)
spark = SparkSession.builder.config("spark.sql.warehouse.dir", "file:///C:/temp").appName("SparkSQL").getOrCreate()
So what we're doing here is creating something called spark
that's going to be a SparkSession
object. We'll use...
Using DataFrames instead of RDDs
Just to drive home how you can actually use DataFrames instead of RDDs, let's go through an example of actually going to one of our earlier exercises that we did with RDDs and do it with DataFrames instead. This will illustrate how using DataFrames can make things simpler. We go back to the example where we figured out the most popular movies based on the MovieLens DataSet ratings information. If you want to open the file, you'll find it in the download package as popular-movies-dataframe.py
, or you can just follow along typing it in as you go. This is what your file should look like if you open it in your IDE:
Let's go through this in detail. First comes our import statements:
from pyspark.sql import SparkSession
from pyspark.sql import Row
from pyspark.sql import functions
We start by importing SparkSession
, which again is our new API in Spark 2.0 for doing DataFrame and DataSet operations. We will import the Row
class and functions
, so we can do SQL functions...
It is interesting how you can actually use these high-level APIs using SparkSQL to save on coding. For example, just look at this one line of code:
topMovieIDs = movieDataset.groupBy("movieID").count().orderBy("count", ascending=False).cache()
Remember that to do the same thing earlier, we had to kind of jump through some hoops and create key/value RDDs, reduce the RDD, and do all sorts of things that weren't very intuitive. Using SparkSQL and DataSets, however, you can do these exercises in a much more intuitive manner. At the same time, you allow Spark the opportunity to represent its data more compactly and optimize those queries in a more efficient manner.
Again, DataFrames are the way of the future with Spark. If you do have the choice between using an RDD and a DataFrame to do the same problem, opt for a DataFrame. It is not only more efficient, but it will also give you more interoperability with more components within Spark going forward. So there you have it: Spark SQL DataFrames...