Packt+ | Advance your knowledge in tech

You're reading from Fast Data Processing with Spark 2 - Third Edition

Product type Book

Published in Oct 2016

Publisher Packt

ISBN-13 9781785889271

Pages 274 pages

Edition 3rd Edition

Languages

Scala

Concepts

Data Processing

Author (1):

Holden Karau

Table of Contents (18) Chapters

Fast Data Processing with Spark 2 Third Edition

Credits

About the Author

About the Reviewers

www.PacktPub.com

Preface

Installing Spark and Setting Up Your Cluster

Using the Spark Shell

Building and Running a Spark Application

Creating a SparkSession Object

Loading and Saving Data in Spark

Manipulating Your RDD

Spark 2.0 Concepts

Spark SQL

Foundations of Datasets/DataFrames – The Proverbial Workhorse for DataScientists

Spark with Big Data

Machine Learning with Spark ML Pipelines

GraphX

Chapter 4. Creating a SparkSession Object

This chapter will cover how to create a SparkSession object in your cluster. A SparkSession object represents the connection to a Spark cluster (local or remote) and provides the entry point to interact with Spark. We need to create SparkSession so that we can interact with Spark and distribute our jobs. In Chapter 2, Using the Spark Shell, we interacted with Spark through the Spark shell which helped us create a SparkSession object and a SparkContext object. Now you can create RDDs, broadcast variables, and counters, and actually do fun things with your data. The Spark shell serves as an example of how to interact with the Spark cluster through the SparkSession and SparkContext object.

For a client to establish a connection to the Spark cluster, the SparkSession object needs some basic information, which is given here:

Master URL: This URL can be local[n] for local mode, Spark://[sparkip] for the Spark server, or mesos://path for a Mesos cluster
...

SparkSession versus SparkContext

You would have noticed that we are using SparkSession and SparkContext, and this is not an error. Let's revisit the annals of Spark history for a perspective. It is important to understand where we came from, as you will hear about these connection objects for some time to come.

Prior to Spark 2.0.0, the three main connection objects were SparkContext, SqlContext, and HiveContext. The SparkContext object was the connection to a Spark execution environment and created RDDs and others, SQLContext worked with SparkSQL in the background of SparkContext, and HiveContext interacted with the Hive stores.

Spark 2.0.0 introduced Datasets/DataFrames as the main distributed data abstraction interface and the SparkSession object as the entry point to a Spark execution environment. Appropriately, the SparkSession object is found in the namespace, org.apache.spark.sql.SparkSession (Scala), or pyspark.sql.sparkSession. A few points to note are as follows:

In Scala and Java...

Building a SparkSession object

In the Scala and Python programs, you build a SparkSession object with the following build pattern:

val sparkSession = new SparkSession.builder.master(master_path).appName("application name").config("optional configuration parameters").getOrCreate()

Tip

While you can hardcode all these values, it's better to read them from the environment with reasonable defaults. This approach provides maximum flexibility to run the code in a changing environment without having to recompile. Using local as the default value for the master makes it easy to launch your application in a test environment locally. By carefully selecting the defaults, you can avoid having to overspecify this.

The spark-shell/pyspark creates the SparkSession object automatically and assigns to the spark variable.

The SparkSession object has the SparkContext object, which you can access with spark.sparkContext.

As we will see later, the SparkSession object unifies more than the context; it also unifies...

SparkContext - metadata

The SparkContext object has a set of metadata that I found useful. The version number, application name, and memory available are useful pieces of information. At the start of a Spark program, I usually display/log the version number.

Value	Use
`appName`	This value is the application name. If you have established a convention, this field can be useful at runtime.
`getConf`	This value returns configuration information.
`getExecutorMemoryStatus`	This value retrieves memory details. It could be useful if you want to check memory details. As Spark is distributed, the values do not mean that you are out of memory.
`Master`	This value is the name of the master.
`Version`	I found this value very useful, especially while testing with different versions.

Execute the following command from the shell:

cd ~/Downloads/spark-2.0.0  ( Or to wherever you have your spark installed)
bin/spark-shell

Refer to the following screenshot:

scala> spark.version
res0: String...

Shared Java and Scala APIs

Once you have a SparkSession object created, it will serve as your main entry point. In the next chapter, you will learn how to use the SparkSession object to load and save data. You can also use SparkSession.SparkContext to launch more Spark jobs and add or remove dependencies. Some of the non-data-driven methods you can use on the SparkSession.SparkContext object are shown here:

Method	Use
`addJar(path)`	This method adds the JAR file for all the future jobs that would run through the `SparkContext` object.
`addFile(path)`	This method downloads the file to all the nodes on the cluster.
`listFiles/listJars`	This method shows the list of all the currently added files/JARs.
`stop()`	This method shuts down `SparkContext`.
`clearFiles()`	This method removes the files so that new nodes will not download them.
`clearJars()`	This method removes the JARs from being required for future jobs.

Python

The Python SparkSession object behaves in the same way as Scala. We can almost run the same commands as shown in the previous section, within the constraints of language semantics:

bin/pyspark

Refer to the following screenshot:

>>> spark.version
u'2.0.0'
>>> sc.version
u'2.0.0'
>>> sc.appName
u'PySparkShell'
>>> sc.master
u'local[*]'
>>> sc.getMemoryStatus
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'SparkContext' object has no attribute 'getMemoryStatus'
>>> from pyspark.conf import SparkConf
>>> conf = SparkConf()
>>> conf.toDebugString()
u'spark.app.name=PySparkShell\nspark.master=local[*]\nspark.submit.deployMode=client'
>>> 
>>> exit() (To exit the pyspark shell)

The PySpark instance does not have the getExecutorMemoryStatus call yet, but we can get some information with the .toDebugString call.

iPython

Finally, let's fire up iPython and interact with the SparkContext object. As mentioned in Chapter 3, Building and Running a Spark Application, refer to the iPython site (http://jupyter.readthedocs.org/en/latest/install.html) for installing the Jupyter and iPython system.

First, change the directory to fdps-v3, where you would have downloaded the code and data for this book:

cd ~/fdps-v3

The command to start iPython is as follows:

PYSPARK_DRIVER_PYTHON=ipython PYSPARK_DRIVER_PYTHON_OPTS="notebook" ~/Downloads/spark-2.0.0/bin/pyspark

The iPython notebook will be launched in the web browser, as shown in the following screenshot, and you will see a list of iPython notebooks:

Click on the 000-PreFlightCheck.ipynb notebook:

Run the first cell using Shift + Enter. You will see the results, including the Python version, Spark version, and so on, as shown in the preceding screenshot. The notebook has more cells, which we will see in the next few chapters.

Now that you are able to create a...

Reference

The references are listed here:

Summary

In this chapter, we covered how to connect to our Spark cluster using a SparkSession and SparkContext object. We saw how the APIs are uniform across all the languages, such as Scala and Python. We also learned a bit about the interactive shell and iPython. Using this knowledge, we will look at the different data sources we can use to load data into Spark in the next chapter.

The rest of the chapter is locked

You have been reading a chapter from

Fast Data Processing with Spark 2 - Third Edition

Published in: Oct 2016 Publisher: Packt ISBN-13: 9781785889271

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at $15.99/month. Cancel anytime}

Authors (1)

Holden Karau

Holden Karau is a software development engineer and is active in the open source. She has worked on a variety of search, classification, and distributed systems problems at IBM, Alpine, Databricks, Google, Foursquare, and Amazon. She graduated from the University of Waterloo with a bachelor's of mathematics degree in computer science. Other than software, she enjoys playing with fire and hula hoops, and welding.

See other products by Holden Karau

Personalised recommendations for you

Based on your interests and search pattern

Et al.

Ever wonder why speech recognition systems don't understand the Scottish accent, or what would happen if an astronaut only ate mac 'n' cheese, or other spurious reflections you'd have at a bar? We did, then collated those deliberations into absurd research articles with fake figures and methodologies inspired by even more fictionally absurd studies.

Aug 2023 7 hours 40 minutes

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

Dec 2023 12 hours 0 minutes

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

Dec 2023 12 hours 0 minutes

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

Dec 2023 12 hours 0 minutes

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

Dec 2023 12 hours 0 minutes

Mastering Tableau 2023

This book is a comprehensive resource to mastering your Tableau skills and becoming a BI expert. As you progress, you will learn how to build advanced dashboards and improve your storytelling to derive key business insight, as well as make you well-versed with advanced functionalities of Tableau in the business intelligence domain.

Aug 2023 22 hours 48 minutes

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

Sep 2023 8 hours 36 minutes

Building AI Applications with ChatGPT APIs

Sep 2023 8 hours 36 minutes

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

Oct 2023 21 hours 12 minutes

Modern Data Architecture on AWS

Every organization wants an agile, performant, and cost-effective data platform that meets all their current and future business needs. Purpose-built AWS analytics services and their features play a big part in building such a modern data platform. This book brings to you all the design and architectural patterns that’ll help you achieve this goal.

Aug 2023 14 hours 0 minutes

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

Dec 2023 8 hours 0 minutes

TinyML Cookbook

With over 70 project-based recipes, the TinyML Cookbook is a practical guide that will help you to get the most out of your microcontrollers. It provides a comprehensive understanding of the theoretical foundations while giving you hands-on experience training ML models for deployment on Arduino Nano 33 BLE Sense, Raspberry Pi Pico, and SparkFun RedBoard Artemis Nano microcontrollers.

Nov 2023 22 hours 8 minutes