Packt+ | Advance your knowledge in tech

You're reading from Frank Kane's Taming Big Data with Apache Spark and Python

Product type Book

Published in Jun 2017

Publisher Packt

ISBN-13 9781787287945

Pages 296 pages

Edition 1st Edition

Languages

Python

Concepts

Big Data

Author (1):

Frank Kane

Table of Contents (13) Chapters

Title Page

Credits

About the Author

www.PacktPub.com

Customer Feedback

Preface

Getting Started with Spark

Spark Basics and Spark Examples

Advanced Examples of Spark Programs

Running Spark on a Cluster

SparkSQL, DataFrames, and DataSets

Other Spark Technologies and Libraries

Where to Go From Here? – Learning More About Spark and Data Science

Chapter 5. SparkSQL, DataFrames, and DataSets

In this chapter, we'll spend some time talking about SparkSQL. This is becoming an increasingly important part of Spark; it basically lets you deal with structured data formats. This means that instead of the RDDs that contain arbitrary information in every row, we're going to give the rows some structure. This will let us do a lot of different things, such as treat our RDDs as little databases. So, we're going to call them DataFrames and DataSets from now on, and you can actually perform SQL queries and SQL-like operations on them, which can be pretty powerful.

Introducing SparkSQL

What is structured data? Basically, it means that when we extend the concept of an RDD to a DataFrame object, we provide the data in the RDD with some structure.

One way to think of it is that it's fundamentally an RDD of row objects. By doing this, we can construct SQL queries. We can have distinct columns in these rows, and we can actually form SQL queries and issue commands in a SQL-like style, which we'll see shortly. Because we have an actual schema associated with the DataFrame, it means that Spark can actually do even more optimization than what it normally would. So, it can do query optimization, just like you would on a SQL database, when it tries to figure out the optimal plan for executing your Spark script. Another nice thing is that you can directly read and write to JSON files or JDBC-compiled and compliant databases. This means that if you do have your source data that's already in a structured format, for example, inside a relational database or inside...

Executing SQL commands and SQL-style functions on a DataFrame

Alright, open up the sparksql.py file that's included in the download files for this book. Let's take a look at it as a real-world example of using SparkSQL in Spark 2.0. You should see the following code in your editor:

Notice that we're importing a few things here. We're importing the SparkSession object and the Row object. The SparkSession object is basically Spark 2.0's way of creating a context to work with SparkSQL. We'll also import collections here:

from pyspark.sql import SparkSession 
from pyspark.sql import Row 
 
import collections

Earlier, we used to create sparkContext objects, but now, we'll create a SparkSession object:

# Create a SparkSession (Note, the config section is only for Windows!) 
spark = SparkSession.builder.config("spark.sql.warehouse.dir", "file:///C:/temp").appName("SparkSQL").getOrCreate()

So what we're doing here is creating something called spark that's going to be a SparkSession object. We'll use...

Using DataFrames instead of RDDs

Just to drive home how you can actually use DataFrames instead of RDDs, let's go through an example of actually going to one of our earlier exercises that we did with RDDs and do it with DataFrames instead. This will illustrate how using DataFrames can make things simpler. We go back to the example where we figured out the most popular movies based on the MovieLens DataSet ratings information. If you want to open the file, you'll find it in the download package as popular-movies-dataframe.py, or you can just follow along typing it in as you go. This is what your file should look like if you open it in your IDE:

Let's go through this in detail. First comes our import statements:

from pyspark.sql import SparkSession 
from pyspark.sql import Row 
from pyspark.sql import functions

We start by importing SparkSession, which again is our new API in Spark 2.0 for doing DataFrame and DataSet operations. We will import the Row class and functions, so we can do SQL functions...

Summary

It is interesting how you can actually use these high-level APIs using SparkSQL to save on coding. For example, just look at this one line of code:

topMovieIDs = movieDataset.groupBy("movieID").count().orderBy("count", ascending=False).cache()

Remember that to do the same thing earlier, we had to kind of jump through some hoops and create key/value RDDs, reduce the RDD, and do all sorts of things that weren't very intuitive. Using SparkSQL and DataSets, however, you can do these exercises in a much more intuitive manner. At the same time, you allow Spark the opportunity to represent its data more compactly and optimize those queries in a more efficient manner.

Again, DataFrames are the way of the future with Spark. If you do have the choice between using an RDD and a DataFrame to do the same problem, opt for a DataFrame. It is not only more efficient, but it will also give you more interoperability with more components within Spark going forward. So there you have it: Spark SQL DataFrames...