Reader small image

You're reading from  Apache Spark 2.x for Java Developers

Product typeBook
Published inJul 2017
Reading LevelIntermediate
PublisherPackt
ISBN-139781787126497
Edition1st Edition
Languages
Right arrow
Authors (2):
Sourav Gulati
Sourav Gulati
author image
Sourav Gulati

Sourav Gulati is associated with software industry for more than 7 years. He started his career with Unix/Linux and Java and then moved towards big data and NoSQL World. He has worked on various big data projects. He has recently started a technical blog called Technical Learning as well. Apart from IT world, he loves to read about mythology.
Read more about Sourav Gulati

Sumit Kumar
Sumit Kumar
author image
Sumit Kumar

Sumit Kumar is a developer with industry insights in telecom and banking. At different junctures, he has worked as a Java and SQL developer, but it is shell scripting that he finds both challenging and satisfying at the same time. Currently, he delivers big data projects focused on batch/near-real-time analytics and the distributed indexed querying system. Besides IT, he takes a keen interest in human and ecological issues.
Read more about Sumit Kumar

View More author details
Right arrow

Chapter 8. Working with Spark SQL

This chapter will introduce Spark SQL and related concepts, like dataframe and dataset. Schema and advanced SQL functions will be discussed from the Apache Spark perspective; and writing custom user-defined function (UDF) and working with various data sources will also be touched upon.

This chapter uses Java APIs to create SQLContext/SparkSession and implement dataframes/datasets from Java RDD for raw data, such as CSV, and structured data, such as JSON.

SQLContext and HiveContext


Prior to Spark 2.0, SparkContext used to be the entry point for Spark applications, an SQLContext and HiveContext used to be the entry points to run Spark SQL. HiveContext is the superset of SQLContext. The SQLContext needs to be created to run Spark SQL on the RDD.

The SQLContext provides connectivity to various data sources. Data can be read from those data sources and Spark SQL can be executed to transform the data as per the requirement. It can be created using SparkContext as follows:

JavaSparkContext javaSparkContext = new JavaSparkContext(conf); 
SQLContext sqlContext = new SQLContext(javaSparkContext); 

The SQLContext creates a wrapper over SparkContext and provides SQL functionality and functions to work with structured data. It comes with the basic level of SQL functions.

The HiveContext, being a superset of SQLContext, provides a lot more functions. The HiveContext lets you write queries using Hive QL Parser ,which means all of the Hive functions can be...

Dataframe and dataset


Dataframes were introduced in Spark 1.3. Dataframe built on the concept of providing schemas over the data. An RDD basically consists of raw data. Although it provides various functions to process the data, it is a collection of Java objects and is involved in the overhead of garbage collection and serialization. Also, Spark SQL concepts can only be leveraged if it contains some schema. So, earlier version of a Spark provide another version of RDD called SchemaRDD.

SchemaRDD

As its name suggests, it is an RDD with schema. As it contains schema, run relation queries can be run on the data along with basic RDD functions. The SchemaRDD can be registered as a table so that SQL queries can be executed on it using Spark SQL. It was available in earlier version of a Spark. However, with Spark Version 1.3, the SchemaRDD was deprecated and dataframe was introduced.

Dataframe

In spite of being an evolved version of SchemaRDD, dataframe comes with big differences to RDDs. It was introduced...

Spark SQL operations


Working in Spark SQL primarily happens in three stages: the creation of dataset, applying SQL operations, and finally persisting the dataset. We have so far been able to create a dataset from RDD and other data sources (refer to Chapter 5, Working with Data and Storage) and also persist the dataset as discussed in the previous section. Now let's look at some of the ways in which SQL operations can be applied to a dataset.

Untyped dataset operation

Once we have created the dataset, then Spark provides a couple of handy functions which perform basic SQL operation and analysis, such as the following:

  • show(): This displays the top 20 rows of the dataset in a tabular form. Strings of more than 20 characters will be truncated, and all cells will be aligned right:
emp_ds.show();

Another variant of the show() function allows the user to enable or disable the 20 characters limit in the show() function by passing a Boolean as false to disable truncation of the string:

emp_ds.show(false...

Hive integration


Spark is integrated really well with Hive, though it does not include much of its dependencies and expects them to be available in its classpath. The following steps explain how to integrate Spark with Hive:

  1. Place hive-site.xml, core-site.xml, and hdfs-site.xml files in the SPARK_HOME/conf folder.
  2. Instantiate SparkSession with Hive support and, if hive-site.xml is not configured, then the context automatically creates metastore_db in the current directory and creates a warehouse directory configured by spark.sql.warehouse.dir, which defaults to the directory spark-warehouse.
SparkSession sparkSession = SparkSession
  .builder()
  .master("local")
  .config("spark.sql.warehouse.dir","Path of Warehouse")
  .appName("DatasetOperations")
  .enableHiveSupport()
  .getOrCreate();
  1. Once we have created a SparkSession with Hive support enabled, we can proceed to use it with the added benefits of query support from Hive. One way to identify the difference between Hive query function support...

Summary


In this chapter, we discussed SparkSession, which is the single entry point for Spark in 2.x versions. We talked about the unification of dataset and dataframe APIs. Then, we created a dataset using RDD and discussed various dataset operations with examples. We also learnt how to execute Spark SQL operations on a dataset by creating temporary views. Last but not least, we learnt how to create UDFs in Spark SQL with examples.

In the next chapter, we will learn how to process real-time streams with Spark.

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Apache Spark 2.x for Java Developers
Published in: Jul 2017Publisher: PacktISBN-13: 9781787126497
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Authors (2)

author image
Sourav Gulati

Sourav Gulati is associated with software industry for more than 7 years. He started his career with Unix/Linux and Java and then moved towards big data and NoSQL World. He has worked on various big data projects. He has recently started a technical blog called Technical Learning as well. Apart from IT world, he loves to read about mythology.
Read more about Sourav Gulati

author image
Sumit Kumar

Sumit Kumar is a developer with industry insights in telecom and banking. At different junctures, he has worked as a Java and SQL developer, but it is shell scripting that he finds both challenging and satisfying at the same time. Currently, he delivers big data projects focused on batch/near-real-time analytics and the distributed indexed querying system. Besides IT, he takes a keen interest in human and ecological issues.
Read more about Sumit Kumar