The SparkContext
object has a set of metadata that I found useful. The version number, application name, and memory available are useful pieces of information. At the start of a Spark program, I usually display/log the version number.
Execute the following command from the shell:
cd ~/Downloads/spark-2.0.0 ( Or to wherever you have your spark installed)
bin/spark-shell
Refer to the following screenshot:
scala> spark.version
res0: String...
Shared Java and Scala APIs
Once you have a SparkSession
object created, it will serve as your main entry point. In the next chapter, you will learn how to use the SparkSession
object to load and save data. You can also use SparkSession.SparkContext
to launch more Spark jobs and add or remove dependencies. Some of the non-data-driven methods you can use on the SparkSession.SparkContext
object are shown here:
The Python SparkSession
object behaves in the same way as Scala. We can almost run the same commands as shown in the previous section, within the constraints of language semantics:
bin/pyspark
Refer to the following screenshot:
>>> spark.version
u'2.0.0'
>>> sc.version
u'2.0.0'
>>> sc.appName
u'PySparkShell'
>>> sc.master
u'local[*]'
>>> sc.getMemoryStatus
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: 'SparkContext' object has no attribute 'getMemoryStatus'
>>> from pyspark.conf import SparkConf
>>> conf = SparkConf()
>>> conf.toDebugString()
u'spark.app.name=PySparkShell\nspark.master=local[*]\nspark.submit.deployMode=client'
>>>
>>> exit() (To exit the pyspark shell)
The PySpark
instance does not have the getExecutorMemoryStatus
call yet, but we can get some information with the .toDebugString
call.
Finally, let's fire up iPython and interact with the SparkContext
object. As mentioned in Chapter 3, Building and Running a Spark Application, refer to the iPython site (http://jupyter.readthedocs.org/en/latest/install.html) for installing the Jupyter and iPython system.
First, change the directory to fdps-v3
, where you would have downloaded the code and data for this book:
cd ~/fdps-v3
The command to start iPython is as follows:
PYSPARK_DRIVER_PYTHON=ipython PYSPARK_DRIVER_PYTHON_OPTS="notebook" ~/Downloads/spark-2.0.0/bin/pyspark
The iPython notebook will be launched in the web browser, as shown in the following screenshot, and you will see a list of iPython notebooks:
Click on the 000-PreFlightCheck.ipynb
notebook:
Run the first cell using Shift + Enter. You will see the results, including the Python version, Spark version, and so on, as shown in the preceding screenshot. The notebook has more cells, which we will see in the next few chapters.
Now that you are able to create a...
In this chapter, we covered how to connect to our Spark cluster using a SparkSession
and SparkContext
object. We saw how the APIs are uniform across all the languages, such as Scala and Python. We also learned a bit about the interactive shell and iPython. Using this knowledge, we will look at the different data sources we can use to load data into Spark in the next chapter.