In this chapter, the following recipes will be covered:
- Downloading an Ubuntu Desktop image
- Installing and configuring Ubuntu with VMWare Fusion on macOS
- Installing and configuring Ubuntu with Oracle VirtualBox on Windows
- Installing and configuring Ubuntu Desktop for Google Cloud Platform
- Installing and configuring Spark and prerequisites on Ubuntu Desktop
- Integrating Jupyter notebooks with Spark
- Starting and configuring a Spark cluster
- Stopping a Spark cluster
Deep learning is the focused study of machine learning algorithms that deploy neural networks as their main method of learning. Deep learning has exploded onto the scene just within the last couple of years. Microsoft, Google, Facebook, Amazon, Apple, Tesla and many other companies are all utilizing deep learning models in their apps, websites, and products. At the same exact time, Spark, an in-memory compute engine running on top of big data sources, has made it easy to process volumes of information at record speeds and ease. In fact, Spark has now become the leading big data development tool for data engineers, machine learning engineers, and data scientists.
Since deep learning models perform better with more data, the synergy between Spark and deep learning allowed for a perfect marriage. Almost as important as the code used to execute deep learning algorithms is the work environment that enables optimal development. Many talented minds are eager to develop neural networks to help answer important questions in their research. Unfortunately, one of the greatest barriers to the development of deep learning models is access to the necessary technical resources required to learn on big data. The purpose of this chapter is to create an ideal virtual development environment for deep learning on Spark.
Spark can be set up for all types of operating systems, whether they reside on-premise or in the cloud. For our purposes, Spark will be installed on a Linux-based virtual machine with Ubuntu as the operating system. There are several advantages to using Ubuntu as the go-to virtual machine, not least of which is cost. Since they are based on open source software, Ubuntu operating systems are free to use and do not require licensing. Cost is always a consideration and one of the main goals of this publication is to minimize the financial footprint required to get started with deep learning on top of a Spark framework.
There are some minimum recommendations required for downloading the image file:
- Minimum of 2 GHz dual-core processor
- Minimum of 2 GB system memory
- Minimum of 25 GB of free hard drive space
Follow the steps in the recipe to download an Ubuntu Desktop image:
- In order to create a virtual machine of Ubuntu Desktop, it is necessary to first download the file from the official website: https://www.ubuntu.com/download/desktop.
As of this writing, Ubuntu Desktop 16.04.3 is the most recent available version for download.
Access the following file in a
.iso
format once the download is complete:ubuntu-16.04.3-desktop-amd64.iso
Virtual environments provide an optimal development workspace by isolating the relationship to the physical or host machine. Developers may be using all types of machines for their host environments such as a MacBook running macOS, a Microsoft Surface running Windows or even a virtual machine on the cloud with Microsoft Azure or AWS; however, to ensure consistency within the output of the code executed, a virtual environment within Ubuntu Desktop will be deployed that can be used and shared among a wide variety of host platforms.
There are several options for desktop virtualization software, depending on whether the host environment is on a Windows or a macOS. There are two common software applications for virtualization when using macOS:
- VMWare Fusion
- Parallels
To learn more about Ubuntu Desktop, you can visit https://www.ubuntu.com/desktop.
This section will focus on building a virtual machine using an Ubuntu operating system with VMWare Fusion.
A previous installation of VMWare Fusion is required on your system. If you do not currently have this, you can download a trial version from the following website:
https://www.vmware.com/products/fusion/fusion-evaluation.html
Follow the steps in the recipe to configure Ubuntu with VMWare Fusion on macOS:
- Once VMWare Fusion is up and running, click on the + button on the upper-left-hand side to begin the configuration process and select
New...,
as seen in the following screenshot:

- Once the selection has been made, select the option to
Install from Disk or Image
, as seen in the following screenshot:

- Select the operating system's
iso
file that was downloaded from the Ubuntu Desktop website, as seen in the following screenshot:


- The configuration process is almost complete. A
Virtual Machine Summary
is displayed with the option toCustomize Settings
to increase theMemory
andHard Disk,
as seen in the following screenshot:

- Anywhere from 20 to 40 GB hard disk space is sufficient for the virtual machine; however, bumping up the memory to either 2 GB or even 4 GB will assist with the performance of the virtual machine when executing Spark code in later chapters. Update the memory by selecting
Processors
andMemory
under theSettings
of the virtual machine and increasing theMemory
to the desired amount, as seen in the following screenshot:

The setup allows for manual configuration of the settings necessary to get Ubuntu Desktop up and running successfully on VMWare Fusion. The memory and hard drive storage can be increased or decreased based on the needs and availability of the host machine.
All that is remaining is to fire up the virtual machine for the first time, which initiates the installation process of the system onto the virtual machine. Once all the setup is complete and the user has logged in, the Ubuntu virtual machine should be available for development, as seen in the following screenshot:

Aside from VMWare Fusion, there is also another product that offers similar functionality on a Mac. It is called Parallels Desktop for Mac. To learn more about VMWare and Parallels, and decide which program is a better fit for your development, visit the following websites:
- https://www.vmware.com/products/fusion.html to download and install VMWare Fusion for Mac
- https://parallels.com to download and install the Parallels Desktop for Mac
Unlike with macOS, there areseveral options to virtualize systems within Windows. This mainly has to do with the fact that virtualization on Windows is very common as most developers are using Windows as their host environment and need virtual environments for testing purposes without affecting any of the dependencies that rely on Windows.
VirtualBox from Oracle is a common virtualization product and is free to use. Oracle VirtualBox provides a straightforward process to get an Ubuntu Desktop virtual machine up and running on top of a Windows environment.
Follow the steps in this recipe to configure Ubuntu with VirtualBox on Windows:
- Initiate an
Oracle VM VirtualBox Manager
. Next, create a new virtual machine by selecting theNew
icon and specify theName
,Type
, andVersion
of the machine, as seen in the following screenshot:

- Select
Expert Mode
as several of the configuration steps will get consolidated, as seen in the following screenshot:

Ideal memory size should be set to at least 2048
MB, or preferably 4096
MB, depending on the resources available on the host machine.
- Additionally, set an optimal hard disk size for an Ubuntu virtual machine performing deep learning algorithms to at least 20 GB, if not more, as seen in the following screenshot:

- Point the virtual machine manager to the
start-up disk
location where the Ubuntuiso
file was downloaded to and thenStart
the creation process, as seen in the following screenshot:

- After allotting some time for the installation, select the Start icon to complete the virtual machine and get it ready for development as seen in the following screenshot:

The setup allows for manual configuration of the settings necessary to get Ubuntu Desktop up and running successfully on Oracle VirtualBox. As was the case with VMWare Fusion, the memory and hard drive storage can be increased or decreased based on the needs and availability of the host machine.
Please note that some machines that run Microsoft Windows are not set up by default for virtualization and users may receive an initial error indicating the VT-x is not enabled. This can be reversed and virtualization may be enabled in the BIOS during a reboot.
To learn more about Oracle VirtualBox and decide whether or not it is a good fit, visit the following website and select Windows hosts
to begin the download process: https://www.virtualbox.org/wiki/Downloads.
Previously, we saw how Ubuntu Desktop could be set up locally using VMWare Fusion. In this section, we will learn how to do the same on Google Cloud Platform.
The only requirement is a Google account username. Begin by logging in to your Google Cloud Platform using your Google account. Google provides a free 12-month subscription with $300 credited to your account. The setup will ask for your bank details; however, Google will not charge you for anything without explicitly letting you know first. Go ahead and verify your bank account and you are good to go.
Follow the steps in the recipe to configure Ubuntu Desktop for Google Cloud Platform:
- Once logged in to your
Google Cloud Platform
, access a dashboard that looks like the one in the following screenshot:

Google Cloud Platform Dashboard
- First, click on the product services button in the top-left-hand corner of your screen. In the drop-down menu, under
Compute
, click onVM instances,
as shown in the following screenshot:

Create a new instance and name it. We are naming it
ubuntuvm1
in our case. Google Cloud automatically creates a project while launching an instance and the instance will be launched under a project ID. The project may be renamed if required.
- After clicking on
Create Instance
, select the zone/area you are located in. - Select
Ubuntu 16.04LTS
under the boot disk as this is the operating system that will be installed in the cloud. Please note that LTS stands for version, and will havelong-term support
from Ubuntu’s developers. - Next, under the boot disk options, select
SSD persistent disk
and increase the size to 50 GB for some added storage space for the instance, as shown in the following screenshot:

- Next, set
Access scopes
toAllow full access to all Cloud APIs
. - Under firewall, please check to
allow HTTP traffic
as well asallow HTTPS traffic
, as shown in the following screenshot:

Selecting options Allow HTTP traffic and HTTPS Traffic
- Once the instance is configured as shown in this section, go ahead and create the instance by clicking on the
Create
button.
Note
After clicking on the Create
button, you will notice that the instance gets created with a unique internal as well as external IP address. We will require this at a later stage. SSH refers to secure shell tunnel, which is basically an encrypted way of communicating in client-server architectures. Think of it as data going to and from your laptop, as well as going to and from Google's cloud servers, through an encrypted tunnel.
- Click on the newly created instance. From the drop-down menu, click on
open in browser window
, as shown in the following screenshot:

- You will see that Google opens up a shell/terminal in a new window, as shown in the following screenshot:

- Once the shell is open, you should have a window that looks like the following screenshot:

- Type the following commands in the Google cloud shell:
$ sudo apt-get update $ sudo apt-get upgrade $ sudo apt-get install gnome-shell $ sudo apt-get install ubuntu-gnome-desktop $ sudo apt-get install autocutsel $ sudo apt-get install gnome-core $ sudo apt-get install gnome-panel $ sudo apt-get install gnome-themes-standard
- When presented with a prompt to continue or not, type
y
and selectENTER
, as shown in the following screenshot:

$ sudo apt-get install tightvncserver $ touch ~/.Xresources
- Next, launch the server by typing the following command:
$ tightvncserver
- This will prompt you to enter a password, which will later be used to log in to the Ubuntu Desktop virtual machine. This password is limited to eight characters and needs to be set and verified, as shown in the following screenshot:

- A startup script is automatically generated by the shell, as shown in the following screenshot. This startup script can be accessed and edited by copying and pasting its
PATH
in the following manner:

- In our case, the command to view and edit the script is:
:~$ vim /home/amrith2kmeanmachine/.vnc/xstartup
This PATH
may be different in each case. Ensure you set the right PATH
. The vim
command opens up the script in the text editor on a Mac.
Note
The local shell generated a startup script as well as a log file. The startup script needs to be opened and edited in a text editor, which will be discussed next.
- After typing the
vim
command, the screen with the startup script should look something like this screenshot:

- Type
i
to enterINSERT
mode. Next, delete all the text in the startup script. It should then look like the following screenshot:

- Copy paste the following code into the startup script:
#!/bin/sh autocutsel -fork xrdb $HOME/.Xresources xsetroot -solid grey export XKL_XMODMAP_DISABLE=1 export XDG_CURRENT_DESKTOP="GNOME-Flashback:Unity" export XDG_MENU_PREFIX="gnome-flashback-" unset DBUS_SESSION_BUS_ADDRESS gnome-session --session=gnome-flashback-metacity --disable-acceleration-check --debug &
- The script should appear in the editor, as seen in the following screenshot:

- Press
Esc
to exit out ofINSERT
mode and type:wq
to write and quit the file. - Once the startup script has been configured, type the following command in the Google shell to kill the server and save the changes:
$ vncserver -kill :1
- This command should produce a process ID that looks like the one in the following screenshot:

- Start the server again by typing the following command:
$ vncserver -geometry 1024x640
The next series of steps will focus on securing the shell tunnel into the Google Cloud instance from the local host. Before typing anything on the local shell/terminal, ensure that Google Cloud is installed. If not already installed, do so by following the instructions in this quick-start guide located at the following website:
https://cloud.google.com/sdk/docs/quickstart-mac-os-x
- Once Google Cloud is installed, open up the terminal on your machine and type the following commands to connect to the Google Cloud compute instance:
$ gcloud compute ssh \ YOUR INSTANCE NAME HERE \ --project YOUR PROJECT NAME HERE \ --zone YOUR TIMEZONE HERE \ --ssh-flag "-L 5901:localhost:5901"
- Ensure that the instance name, project ID, and zone are specified correctly in the preceding commands. On pressing
ENTER
, the output on the local shell changes to what is shown in the following screenshot:

- Once you see the name of your instance followed by
":~$"
, it means that a connection has successfully been established between the local host/laptop and the Google Cloud instance. After successfully SSHing into the instance, we require software called VNC Viewerto view and interact with the Ubuntu Desktop that has now been successfully set up on the Google Cloud Compute engine. The following few steps will discuss how this is achieved.
https://www.realvnc.com/en/connect/download/viewer/
- Once installed, click to open VNC Viewer and in the search bar, type in
localhost::5901
, as shown in the following screenshot:

- Next, click on
continue
when prompted with the following screen:

- This will prompt you to enter your password for the virtual machine. Enter the password that you set earlier while launching the
tightvncserver
command for the first time, as shown in the following screenshot:

- You will finally be taken into the desktop of your Ubuntu virtual machine on Google Cloud Compute. Your Ubuntu Desktop screen must now look something like the following screenshot when viewed on VNC Viewer:

You have now successfully set up VNC Viewer for interactions with the Ubuntu virtual machine/desktop. Anytime the Google Cloud instance is not in use, it is recommended to suspend or shut down the instance so that additional costs are not being incurred. The cloud approach is optimal for developers who may not have access to physical resources with high memory and storage.
While we discussed Google Cloud as a cloud option for Spark, it is possible to leverage Spark on the following cloud platforms as well:
- Microsoft Azure
- Amazon Web Services
Before Spark can get up and running, there are some necessary prerequisites that need to be installed on a newly minted Ubuntu Desktop. This section will focus on installing and configuring the following on Ubuntu Desktop:
- Java 8 or higher
- Anaconda
- Spark
The only requirement for this section is having administrative rights to install applications onto the Ubuntu Desktop.
This section walks through the steps in the recipe to install Python 3, Anaconda, and Spark on Ubuntu Desktop:
- Install Java on Ubuntu through the
terminal
application, which can be found by searching for the app and then locking it to the launcher on the left-hand side, as seen in the following screenshot:

- Perform an initial test for Java on the virtual machine by executing the following command at the terminal:
java -version
- Execute the following four commands at the terminal to install Java:
sudo apt-get install software-properties-common $ sudo add-apt-repository ppa:webupd8team/java $ sudo apt-get update $ sudo apt-get install oracle-java8-installer
- After accepting the necessary license agreements for Oracle, perform a secondary test of Java on the virtual machine by executing
java -version
once again in the terminal. A successful installation for Java will display the following outcome in the terminal:
$ java -version java version "1.8.0_144" Java(TM) SE Runtime Environment (build 1.8.0_144-b01) Java HotSpot(TM) 64-Bit Server VM (build 25.144-b01, mixed mode)
- Next, install the most recent version of Anaconda. Current versions of Ubuntu Desktop come preinstalled with Python. While it is convenient that Python comes preinstalled with Ubuntu, the installed version is for
Python 2.7
, as seen in the following output:
$ python --version Python 2.7.12
- The current version of Anaconda is v4.4 and the current version of Python 3 is v3.6. Once downloaded, view the Anaconda installation file by accessing the
Downloads
folder using the following command:
$ cd Downloads/ ~/Downloads$ ls Anaconda3-4.4.0-Linux-x86_64.sh
- Once in the
Downloads
folder, initiate the installation for Anaconda by executing the following command:
~/Downloads$ bash Anaconda3-4.4.0-Linux-x86_64.sh Welcome to Anaconda3 4.4.0 (by Continuum Analytics, Inc.) In order to continue the installation process, please review the license agreement. Please, press ENTER to continue
Note
Please note that the version of Anaconda, as well as any other software installed, may differ as newer updates are released to the public. The version of Anaconda that we are using in this chapter and in this book can be downloaded from https://repo.continuum.io/archive/Anaconda3-4.4.0-Linux-x86.sh
- Once the Anaconda installation is complete, restart the
Terminal
application to confirm that Python 3 is now the default Python environment through Anaconda by executingpython --version
in the terminal:
$ python --version Python 3.6.1 :: Anaconda 4.4.0 (64-bit)
- The Python 2 version is still available under Linux, but will require an explicit call when executing a script, as seen in the following command:
~$ python2 --version Python 2.7.12
- Visit the following website to begin the Spark download and installation process:
https://spark.apache.org/downloads.html
- Select the download link. The following file will be downloaded to the
Downloads
folder in Ubuntu:
spark-2.2.0-bin-hadoop2.7.tgz
- View the file at the terminal level by executing the following commands:
$ cd Downloads/ ~/Downloads$ ls spark-2.2.0-bin-hadoop2.7.tgz
- Extract the
tgz
file by executing the following command:
~/Downloads$ tar -zxvf spark-2.2.0-bin-hadoop2.7.tgz
~/Downloads$ ls spark-2.2.0-bin-hadoop2.7 spark-2.2.0-bin-hadoop2.7.tgz
~/Downloads$ mv spark-2.2.0-bin-hadoop2.7 ~/ ~/Downloads$ ls spark-2.2.0-bin-hadoop2.7.tgz ~/Downloads$ cd ~$ ls anaconda3 Downloads Pictures Templates Desktop examples.desktop Public Videos Documents Music spark-2.2.0-bin-hadoop2.7
- Now, the
spark-2.2.0-bin-hadoop2.7
folder has been moved to theHome
folder, which can be viewed when selecting theFiles
icon on the left-hand side toolbar, as seen in the following screenshot:

- Spark is now installed. Initiate Spark from the terminal by executing the following script at the terminal level:
~$ cd ~/spark-2.2.0-bin-hadoop2.7/ ~/spark-2.2.0-bin-hadoop2.7$ ./bin/pyspark
- Perform a final test to ensure Spark is up and running at the terminal by executing the following command to ensure that the
SparkContext
is driving the cluster in the local environment:
>>> sc
<SparkContext master=local[*] appName=PySparkShell>
This section explains the reasoning behind the installation process for Python, Anaconda, and Spark.
Note
In order for Spark to run on a local machine or in a cluster, a minimum version of Java 6 is required for installation.
- Ubuntu recommends the
sudo apt install
method for Java as it ensures that packages downloaded are up to date. - Please note that if Java is not currently installed, the output in the terminal will show the following message:
The program 'java' can be found in the following packages: * default-jre * gcj-5-jre-headless * openjdk-8-jre-headless * gcj-4.8-jre-headless * gcj-4.9-jre-headless * openjdk-9-jre-headless Try: sudo apt install <selected package>
- While Python 2 is fine, it is considered legacy Python. Python 2 is facing an end of life date in 2020; therefore, it is recommended that all new Python development be performed with Python 3, as will be the case in this publication. Up until recently, Spark was only available with Python 2. That is no longer the case. Spark works with both Python 2 and 3. A convenient way to install Python 3, as well as many dependencies and libraries, is through Anaconda. Anaconda is a free and open source distribution of Python, as well as R. Anaconda manages the installation and maintenance of many of the most common packages used in Python for data science-related tasks.
During the installation process for Anaconda, it is important to confirm the following conditions:
- Anaconda is installed in the
/home/username/Anaconda3
location - The Anaconda installer prepends the Anaconda3 install location to a
PATH
in/home/username/.bashrc
- Anaconda is installed in the
- After Anaconda has been installed, download Spark. Unlike Python, Spark does not come preinstalled on Ubuntu and therefore, will need to be downloaded and installed.
For the purposes of development with deep learning, the following preferences will be selected for Spark:
- Spark release: 2.2.0 (Jul 11 2017)
- Package type: Prebuilt for Apache Hadoop 2.7 and later
- Download type: Direct download
- Once Spark has been successfully installed, the output from executing Spark at the command line should look something similar to that shown in the following screenshot:

- Two important features to note when initializing Spark are that it is under the
Python 3.6.1
|Anaconda 4.4.0 (64-bit)
| framework and that the Spark logo is version 2.2.0. - Congratulations! Spark is successfully installed on the local Ubuntu virtual machine. But, not everything is complete. Spark development is best when Spark code can be executed within a Jupyter notebook, especially for deep learning. Thankfully, Jupyter has been installed with the Anaconda distribution performed earlier in this section.
You may be asking why we did not just use pip install pyspark
to use Spark in Python. Previous versions of Spark required going through the installation process that we did in this section. Future versions of Spark, starting with 2.2.0 will begin to allow installation directly through the pip
approach. We used the full installation method in this section to ensure that you will be able to get Spark installed and fully-integrated, in case you are using an earlier version of Spark.
To learn more about Jupyter notebooks and their integration with Python, visit the following website:
To learn more about Anaconda and download a version for Linux, visit the following website:
When learning Python for the first time, it is useful to use Jupyter notebooks as an interactive developing environment (IDE). This is one of the main reasons why Anaconda is so powerful. It fully integrates all of the dependencies between Python and Jupyter notebooks. The same can be done with PySpark and Jupyter notebooks. While Spark is written in Scala, PySpark allows for the translation of code to occur within Python instead.
Most of the work in this section will just require accessing the .bashrc
script from the terminal.
PySpark is not configured to work within Jupyter notebooks by default, but a slight tweak of the .bashrc
script can remedy this issue. We will walk through these steps in this section:
- Access the
.bashrc
script by executing the following command:
$ nano .bashrc
- Scrolling all the way to the end of the script should reveal the last command modified, which should be the
PATH
set by Anaconda during the installation earlier in the previous section. ThePATH
should appear as seen in the following:
# added by Anaconda3 4.4.0 installer export PATH="/home/asherif844/anaconda3/bin:$PATH"
- Underneath, the
PATH
added by the Anaconda installer can include a custom function that helps communicate the Spark installation with the Jupyter notebook installation from Anaconda3. For the purposes of this chapter and remaining chapters, we will name that functionsparknotebook
. The configuration should appear as the following forsparknotebook()
:
function sparknotebook() { export SPARK_HOME=/home/asherif844/spark-2.2.0-bin-hadoop2.7 export PYSPARK_PYTHON=python3 export PYSPARK_DRIVER_PYTHON=jupyter export PYSPARK_DRIVER_PYTHON_OPTS="notebook" $SPARK_HOME/bin/pyspark }
- The updated
.bashrc
script should look like the following once saved:

- Save and exit from the
.bashrc
file. It is recommended to communicate that the.bashrc
file has been updated by executing the following command and restarting the terminal application:
$ source .bashrc
Our goal in this section is to integrate Spark directly into a Jupyter notebook so that we are not doing our development at the terminal and instead utilizing the benefits of developing within a notebook. This section explains how the Spark integration within a Jupyter notebook takes place.
- We will create a command function,
sparknotebook
, that we can call from the terminal to open up a Spark session through Jupyter notebooks from the Anaconda installation. This requires two settings to be set in the.bashrc
file:- PySpark Python be set to python 3
- PySpark driver for python to be set to Jupyter
- The
sparknotebook
function can now be accessed directly from the terminal by executing the following command:
$ sparknotebook
- The function should then initiate a brand new Jupyter notebook session through the default web browser. A new Python script within Jupyter notebooks with a
.ipynb
extension can be created by clicking on theNew
button on the right-hand side and by selectingPython 3
underNotebook:
as seen in the following screenshot:

- Once again, just as was done at the terminal level for Spark, a simple script of
sc
will be executed within the notebook to confirm that Spark is up and running through Jupyter:

- Ideally, the
Version
,Master
, andAppName
should be identical to the earlier output whensc
was executed at the terminal. If this is the case, then PySpark has been successfully installed and configured to work with Jupyter notebooks.
It is important to note that if we were to call a Jupyter notebook through the terminal without specifying sparknotebook
, our Spark session will never be initiated and we will receive an error when executing the SparkContext
script.
We can access a traditional Jupyter notebook by executing the following at the terminal:
jupyter-notebook
Once we start the notebook, we can try and execute the same script for sc.master
as we did previously, but this time we will receive the following error:

There are many managed offerings online of companies offering Spark through a notebook interface where the installation and configuration of Spark with a notebook have already been managed for you. These are the following:
- Hortonworks (https://hortonworks.com/)
- Cloudera (https://www.cloudera.com/)
- MapR (https://mapr.com/)
- DataBricks (https://databricks.com/)
For most chapters, one of the first things that we will do is to initialize and configure our Spark cluster.
This section walks through the steps to initialize and configure a Spark cluster.
- Import
SparkSession
using the following script:
from pyspark.sql import SparkSession
- Configure
SparkSession
with a variable namedspark
using the following script:
spark = SparkSession.builder \ .master("local[*]") \ .appName("GenericAppName") \ .config("spark.executor.memory", "6gb") \ .getOrCreate()
This section explains how the SparkSession
works as an entry point to develop within Spark.
- Staring with Spark 2.0, it is no longer necessary to create a
SparkConf
andSparkContext
to begin development in Spark. Those steps are no longer needed as importingSparkSession
will handle initializing a cluster. Additionally, it is important to note thatSparkSession
is part of thesql
module frompyspark
. - We can assign properties to our
SparkSession
:master
: assigns the Spark master URL to run on ourlocal
machine with the maximum available number of cores.appName
: assign a name for the application-
config
: assign6gb
to thespark.executor.memory
getOrCreate
: ensures that aSparkSession
is created if one is not available and retrieves an existing one if it is available
For development purposes, while we are building an application on smaller datasets, we can just use master("local")
. If we were to deploy on a production environment, we would want to specify master("local[*]")
to ensure we are using the maximum cores available and get optimal performance.
To learn more about SparkSession.builder
, visit the following website:
https://spark.apache.org/docs/2.2.0/api/java/org/apache/spark/sql/SparkSession.Builder.html
Once we are done developing on our cluster, it is ideal to shut it down and preserve resources.
This section walks through the steps to stop the SparkSession
.
- Execute the following script:
spark.stop()
- Confirm that the session has closed by executing the following script:
sc.master
This section explains how to confirm that a Spark cluster has been shut down.
- If the cluster has been shut down, you will receive the error message seen in the following screenshot when executing another Spark command in the notebook:
