Search icon
Arrow left icon
All Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletters
Free Learning
Arrow right icon
Fast Data Processing with Spark 2 - Third Edition

You're reading from  Fast Data Processing with Spark 2 - Third Edition

Product type Book
Published in Oct 2016
Publisher Packt
ISBN-13 9781785889271
Pages 274 pages
Edition 3rd Edition
Languages
Author (1):
Holden Karau Holden Karau
Profile icon Holden Karau

Table of Contents (18) Chapters

Fast Data Processing with Spark 2 Third Edition
Credits
About the Author
About the Reviewers
www.PacktPub.com
Preface
Installing Spark and Setting Up Your Cluster Using the Spark Shell Building and Running a Spark Application Creating a SparkSession Object Loading and Saving Data in Spark Manipulating Your RDD Spark 2.0 Concepts Spark SQL Foundations of Datasets/DataFrames – The Proverbial Workhorse for DataScientists Spark with Big Data Machine Learning with Spark ML Pipelines GraphX

Chapter 4. Creating a SparkSession Object

This chapter will cover how to create a SparkSession object in your cluster. A SparkSession object represents the connection to a Spark cluster (local or remote) and provides the entry point to interact with Spark. We need to create SparkSession so that we can interact with Spark and distribute our jobs. In Chapter 2, Using the Spark Shell, we interacted with Spark through the Spark shell which helped us create a SparkSession object and a SparkContext object. Now you can create RDDs, broadcast variables, and counters, and actually do fun things with your data. The Spark shell serves as an example of how to interact with the Spark cluster through the SparkSession and SparkContext object.

For a client to establish a connection to the Spark cluster, the SparkSession object needs some basic information, which is given here:

  • Master URL: This URL can be local[n] for local mode, Spark://[sparkip] for the Spark server, or mesos://path for a Mesos cluster

  • ...

SparkSession versus SparkContext


You would have noticed that we are using SparkSession and SparkContext, and this is not an error. Let's revisit the annals of Spark history for a perspective. It is important to understand where we came from, as you will hear about these connection objects for some time to come.

Prior to Spark 2.0.0, the three main connection objects were SparkContext, SqlContext, and HiveContext. The SparkContext object was the connection to a Spark execution environment and created RDDs and others, SQLContext worked with SparkSQL in the background of SparkContext, and HiveContext interacted with the Hive stores.

Spark 2.0.0 introduced Datasets/DataFrames as the main distributed data abstraction interface and the SparkSession object as the entry point to a Spark execution environment. Appropriately, the SparkSession object is found in the namespace, org.apache.spark.sql.SparkSession (Scala), or pyspark.sql.sparkSession. A few points to note are as follows:

  • In Scala and Java...

Building a SparkSession object


In the Scala and Python programs, you build a SparkSession object with the following build pattern:

val sparkSession = new SparkSession.builder.master(master_path).appName("application name").config("optional configuration parameters").getOrCreate() 

Tip

While you can hardcode all these values, it's better to read them from the environment with reasonable defaults. This approach provides maximum flexibility to run the code in a changing environment without having to recompile. Using local as the default value for the master makes it easy to launch your application in a test environment locally. By carefully selecting the defaults, you can avoid having to overspecify this.

The spark-shell/pyspark creates the SparkSession object automatically and assigns to the spark variable.

The SparkSession object has the SparkContext object, which you can access with spark.sparkContext.

As we will see later, the SparkSession object unifies more than the context; it also unifies...

SparkContext - metadata


The SparkContext object has a set of metadata that I found useful. The version number, application name, and memory available are useful pieces of information. At the start of a Spark program, I usually display/log the version number.

Value

Use

appName

This value is the application name. If you have established a convention, this field can be useful at runtime.

getConf

This value returns configuration information.

getExecutorMemoryStatus

This value retrieves memory details. It could be useful if you want to check memory details. As Spark is distributed, the values do not mean that you are out of memory.

Master

This value is the name of the master.

Version

I found this value very useful, especially while testing with different versions.

Execute the following command from the shell:

cd ~/Downloads/spark-2.0.0  ( Or to wherever you have your spark installed)
bin/spark-shell

Refer to the following screenshot:

scala> spark.version
res0: String...

Shared Java and Scala APIs


Once you have a SparkSession object created, it will serve as your main entry point. In the next chapter, you will learn how to use the SparkSession object to load and save data. You can also use SparkSession.SparkContext to launch more Spark jobs and add or remove dependencies. Some of the non-data-driven methods you can use on the SparkSession.SparkContext object are shown here:

Method

Use

addJar(path)

This method adds the JAR file for all the future jobs that would run through the SparkContext object.

addFile(path)

This method downloads the file to all the nodes on the cluster.

listFiles/listJars

This method shows the list of all the currently added files/JARs.

stop()

This method shuts down SparkContext.

clearFiles()

This method removes the files so that new nodes will not download them.

clearJars()

This method removes the JARs from being required for future jobs.

Python


The Python SparkSession object behaves in the same way as Scala. We can almost run the same commands as shown in the previous section, within the constraints of language semantics:

bin/pyspark

Refer to the following screenshot:

>>> spark.version
u'2.0.0'
>>> sc.version
u'2.0.0'
>>> sc.appName
u'PySparkShell'
>>> sc.master
u'local[*]'
>>> sc.getMemoryStatus
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'SparkContext' object has no attribute 'getMemoryStatus'
>>> from pyspark.conf import SparkConf
>>> conf = SparkConf()
>>> conf.toDebugString()
u'spark.app.name=PySparkShell\nspark.master=local[*]\nspark.submit.deployMode=client'
>>> 
>>> exit() (To exit the pyspark shell)

The PySpark instance does not have the getExecutorMemoryStatus call yet, but we can get some information with the .toDebugString call.

iPython


Finally, let's fire up iPython and interact with the SparkContext object. As mentioned in Chapter 3, Building and Running a Spark Application, refer to the iPython site (http://jupyter.readthedocs.org/en/latest/install.html) for installing the Jupyter and iPython system.

First, change the directory to fdps-v3, where you would have downloaded the code and data for this book:

cd ~/fdps-v3

The command to start iPython is as follows:

PYSPARK_DRIVER_PYTHON=ipython PYSPARK_DRIVER_PYTHON_OPTS="notebook" ~/Downloads/spark-2.0.0/bin/pyspark

The iPython notebook will be launched in the web browser, as shown in the following screenshot, and you will see a list of iPython notebooks:

Click on the 000-PreFlightCheck.ipynb notebook:

Run the first cell using Shift + Enter. You will see the results, including the Python version, Spark version, and so on, as shown in the preceding screenshot. The notebook has more cells, which we will see in the next few chapters.

Now that you are able to create a...

Summary


In this chapter, we covered how to connect to our Spark cluster using a SparkSession and SparkContext object. We saw how the APIs are uniform across all the languages, such as Scala and Python. We also learned a bit about the interactive shell and iPython. Using this knowledge, we will look at the different data sources we can use to load data into Spark in the next chapter.

lock icon The rest of the chapter is locked
You have been reading a chapter from
Fast Data Processing with Spark 2 - Third Edition
Published in: Oct 2016 Publisher: Packt ISBN-13: 9781785889271
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime}