Apache Spark Deep Learning Cookbook

Setting Up Spark for Deep Learning Development

In this chapter, the following recipes will be covered:

Downloading an Ubuntu Desktop image
Installing and configuring Ubuntu with VMWare Fusion on macOS
Installing and configuring Ubuntu with Oracle VirtualBox on Windows
Installing and configuring Ubuntu Desktop for Google Cloud Platform
Installing and configuring Spark and prerequisites on Ubuntu Desktop
Integrating Jupyter notebooks with Spark
Starting and configuring a Spark cluster
Stopping a Spark cluster

Downloading an Ubuntu Desktop image

Spark can be set up for all types of operating systems, whether they reside on-premise or in the cloud. For our purposes, Spark will be installed on a Linux-based virtual machine with Ubuntu as the operating system. There are several advantages to using Ubuntu as the go-to virtual machine, not least of which is cost. Since they are based on open source software, Ubuntu operating systems are free to use and do not require licensing. Cost is always a consideration and one of the main goals of this publication is to minimize the financial footprint required to get started with deep learning on top of a Spark framework.

Getting ready

There are some minimum recommendations required for downloading the image file:

Minimum of 2 GHz dual-core processor
Minimum of 2 GB system memory
Minimum of 25 GB of free hard drive space

How to do it...

Follow the steps in the recipe to download an Ubuntu Desktop image:

In order to create a virtual machine of Ubuntu Desktop, it is necessary to first download the file from the official website: https://www.ubuntu.com/download/desktop.
As of this writing, Ubuntu Desktop 16.04.3 is the most recent available version for download.

Access the following file in a .iso format once the download is complete:

ubuntu-16.04.3-desktop-amd64.iso

How it works...

Virtual environments provide an optimal development workspace by isolating the relationship to the physical or host machine. Developers may be using all types of machines for their host environments such as a MacBook running macOS, a Microsoft Surface running Windows or even a virtual machine on the cloud with Microsoft Azure or AWS; however, to ensure consistency within the output of the code executed, a virtual environment within Ubuntu Desktop will be deployed that can be used and shared among a wide variety of host platforms.

There's more...

There are several options for desktop virtualization software, depending on whether the host environment is on a Windows or a macOS. There are two common software applications for virtualization when using macOS:

VMWare Fusion
Parallels

Installing and configuring Ubuntu with VMWare Fusion on macOS

This section will focus on building a virtual machine using an Ubuntu operating system with VMWare Fusion.

Getting ready

A previous installation of VMWare Fusion is required on your system. If you do not currently have this, you can download a trial version from the following website:

https://www.vmware.com/products/fusion/fusion-evaluation.html

How to do it...

Follow the steps in the recipe to configure Ubuntu with VMWare Fusion on macOS:

Once VMWare Fusion is up and running, click on the + button on the upper-left-hand side to begin the configuration process and select New..., as seen in the following screenshot:

Once the selection has been made, select the option to Install from Disk or Image, as seen in the following screenshot:

Select the operating system's iso file that was downloaded from the Ubuntu Desktop website, as seen in the following screenshot:

The next step will ask whether you want to choose Linux Easy Install. It is recommended to do so, as well as to incorporate a Display Name/Password combination for the Ubuntu environment, as seen in the following screenshot:

The configuration process is almost complete. A Virtual Machine Summary is displayed with the option to Customize Settings to increase the Memory and Hard Disk, as seen in the following screenshot:

Anywhere from 20 to 40 GB hard disk space is sufficient for the virtual machine; however, bumping up the memory to either 2 GB or even 4 GB will assist with the performance of the virtual machine when executing Spark code in later chapters. Update the memory by selecting Processors and Memory under the Settings of the virtual machine and increasing the Memory to the desired amount, as seen in the following screenshot:

How it works...

The setup allows for manual configuration of the settings necessary to get Ubuntu Desktop up and running successfully on VMWare Fusion. The memory and hard drive storage can be increased or decreased based on the needs and availability of the host machine.

There's more...

All that is remaining is to fire up the virtual machine for the first time, which initiates the installation process of the system onto the virtual machine. Once all the setup is complete and the user has logged in, the Ubuntu virtual machine should be available for development, as seen in the following screenshot:

Installing and configuring Ubuntu with Oracle VirtualBox on Windows

Unlike with macOS, there are several options to virtualize systems within Windows. This mainly has to do with the fact that virtualization on Windows is very common as most developers are using Windows as their host environment and need virtual environments for testing purposes without affecting any of the dependencies that rely on Windows.

Getting ready

VirtualBox from Oracle is a common virtualization product and is free to use. Oracle VirtualBox provides a straightforward process to get an Ubuntu Desktop virtual machine up and running on top of a Windows environment.

How to do it...

Follow the steps in this recipe to configure Ubuntu with VirtualBox on Windows:

Initiate an Oracle VM VirtualBox Manager. Next, create a new virtual machine by selecting the New icon and specify the Name, Type, and Version of the machine, as seen in the following screenshot:

Select Expert Mode as several of the configuration steps will get consolidated, as seen in the following screenshot:

Ideal memory size should be set to at least 2048 MB, or preferably 4096 MB, depending on the resources available on the host machine.

Additionally, set an optimal hard disk size for an Ubuntu virtual machine performing deep learning algorithms to at least 20 GB, if not more, as seen in the following screenshot:

Point the virtual machine manager to the start-up disk location where the Ubuntu iso file was downloaded to and then Start the creation process, as seen in the following screenshot:

After allotting some time for the installation, select the Start icon to complete the virtual machine and get it ready for development as seen in the following screenshot:

How it works...

The setup allows for manual configuration of the settings necessary to get Ubuntu Desktop up and running successfully on Oracle VirtualBox. As was the case with VMWare Fusion, the memory and hard drive storage can be increased or decreased based on the needs and availability of the host machine.

There's more...

Please note that some machines that run Microsoft Windows are not set up by default for virtualization and users may receive an initial error indicating the VT-x is not enabled. This can be reversed and virtualization may be enabled in the BIOS during a reboot.

Installing and configuring Ubuntu Desktop for Google Cloud Platform

Previously, we saw how Ubuntu Desktop could be set up locally using VMWare Fusion. In this section, we will learn how to do the same on Google Cloud Platform.

Getting ready

The only requirement is a Google account username. Begin by logging in to your Google Cloud Platform using your Google account. Google provides a free 12-month subscription with $300 credited to your account. The setup will ask for your bank details; however, Google will not charge you for anything without explicitly letting you know first. Go ahead and verify your bank account and you are good to go.

How to do it...

Follow the steps in the recipe to configure Ubuntu Desktop for Google Cloud Platform:

Once logged in to your Google Cloud Platform, access a dashboard that looks like the one in the following screenshot:

Google Cloud Platform Dashboard

First, click on the product services button in the top-left-hand corner of your screen. In the drop-down menu, under Compute, click on VM instances, as shown in the following screenshot:

Create a new instance and name it. We are naming it ubuntuvm1 in our case. Google Cloud automatically creates a project while launching an instance and the instance will be launched under a project ID. The project may be renamed if required.

After clicking on Create Instance, select the zone/area you are located in.
Select Ubuntu 16.04LTS under the boot disk as this is the operating system that will be installed in the cloud. Please note that LTS stands for version, and will have long-term support from Ubuntu’s developers.
Next, under the boot disk options, select SSD persistent disk and increase the size to 50 GB for some added storage space for the instance, as shown in the following screenshot:

Next, set Access scopes to Allow full access to all Cloud APIs.
Under firewall, please check to allow HTTP traffic as well as allow HTTPS traffic, as shown in the following screenshot:

Selecting options Allow HTTP traffic and HTTPS Traffic

Once the instance is configured as shown in this section, go ahead and create the instance by clicking on the Create button.

After clicking on the Create button, you will notice that the instance gets created with a unique internal as well as external IP address. We will require this at a later stage. SSH refers to secure shell tunnel, which is basically an encrypted way of communicating in client-server architectures. Think of it as data going to and from your laptop, as well as going to and from Google's cloud servers, through an encrypted tunnel.

Click on the newly created instance. From the drop-down menu, click on open in browser window, as shown in the following screenshot:

You will see that Google opens up a shell/terminal in a new window, as shown in the following screenshot:

Once the shell is open, you should have a window that looks like the following screenshot:

Type the following commands in the Google cloud shell:

$ sudo apt-get update
$ sudo apt-get upgrade
$ sudo apt-get install gnome-shell
$ sudo apt-get install ubuntu-gnome-desktop
$ sudo apt-get install autocutsel
$ sudo apt-get install gnome-core
$ sudo apt-get install gnome-panel
$ sudo apt-get install gnome-themes-standard

When presented with a prompt to continue or not, type y and select ENTER, as shown in the following screenshot:

Once done with the preceding steps, type the following commands to set up the vncserver and allow connections to the local shell:

$ sudo apt-get install tightvncserver
$ touch ~/.Xresources

Next, launch the server by typing the following command:

$ tightvncserver

This will prompt you to enter a password, which will later be used to log in to the Ubuntu Desktop virtual machine. This password is limited to eight characters and needs to be set and verified, as shown in the following screenshot:

A startup script is automatically generated by the shell, as shown in the following screenshot. This startup script can be accessed and edited by copying and pasting its PATH in the following manner:

In our case, the command to view and edit the script is:

:~$ vim /home/amrith2kmeanmachine/.vnc/xstartup

This PATH may be different in each case. Ensure you set the right PATH. The vim command opens up the script in the text editor on a Mac.

The local shell generated a startup script as well as a log file. The startup script needs to be opened and edited in a text editor, which will be discussed next.

After typing the vim command, the screen with the startup script should look something like this screenshot:

Type i to enter INSERT mode. Next, delete all the text in the startup script. It should then look like the following screenshot:

Copy paste the following code into the startup script:

#!/bin/sh
autocutsel -fork
xrdb $HOME/.Xresources
xsetroot -solid grey
export XKL_XMODMAP_DISABLE=1
export XDG_CURRENT_DESKTOP="GNOME-Flashback:Unity"
export XDG_MENU_PREFIX="gnome-flashback-"
unset DBUS_SESSION_BUS_ADDRESS
gnome-session --session=gnome-flashback-metacity --disable-acceleration-check --debug &

The script should appear in the editor, as seen in the following screenshot:

Press Esc to exit out of INSERT mode and type :wq to write and quit the file.
Once the startup script has been configured, type the following command in the Google shell to kill the server and save the changes:

$ vncserver -kill :1

This command should produce a process ID that looks like the one in the following screenshot:

Start the server again by typing the following command:

$ vncserver -geometry 1024x640

The next series of steps will focus on securing the shell tunnel into the Google Cloud instance from the local host. Before typing anything on the local shell/terminal, ensure that Google Cloud is installed. If not already installed, do so by following the instructions in this quick-start guide located at the following website:

https://cloud.google.com/sdk/docs/quickstart-mac-os-x

Once Google Cloud is installed, open up the terminal on your machine and type the following commands to connect to the Google Cloud compute instance:

$ gcloud compute ssh \
YOUR INSTANCE NAME HERE \
--project YOUR PROJECT NAME HERE \
--zone YOUR TIMEZONE HERE \
--ssh-flag "-L 5901:localhost:5901"

Ensure that the instance name, project ID, and zone are specified correctly in the preceding commands. On pressing ENTER, the output on the local shell changes to what is shown in the following screenshot:

Once you see the name of your instance followed by ":~$", it means that a connection has successfully been established between the local host/laptop and the Google Cloud instance. After successfully SSHing into the instance, we require software called VNC Viewer to view and interact with the Ubuntu Desktop that has now been successfully set up on the Google Cloud Compute engine. The following few steps will discuss how this is achieved.

VNC Viewer may be downloaded using the following link:

https://www.realvnc.com/en/connect/download/viewer/

Once installed, click to open VNC Viewer and in the search bar, type in localhost::5901, as shown in the following screenshot:

Next, click on continue when prompted with the following screen:

This will prompt you to enter your password for the virtual machine. Enter the password that you set earlier while launching the tightvncserver command for the first time, as shown in the following screenshot:

You will finally be taken into the desktop of your Ubuntu virtual machine on Google Cloud Compute. Your Ubuntu Desktop screen must now look something like the following screenshot when viewed on VNC Viewer:

How it works...

You have now successfully set up VNC Viewer for interactions with the Ubuntu virtual machine/desktop. Anytime the Google Cloud instance is not in use, it is recommended to suspend or shut down the instance so that additional costs are not being incurred. The cloud approach is optimal for developers who may not have access to physical resources with high memory and storage.

There's more...

While we discussed Google Cloud as a cloud option for Spark, it is possible to leverage Spark on the following cloud platforms as well:

Microsoft Azure
Amazon Web Services

Installing and configuring Spark and prerequisites on Ubuntu Desktop

Before Spark can get up and running, there are some necessary prerequisites that need to be installed on a newly minted Ubuntu Desktop. This section will focus on installing and configuring the following on Ubuntu Desktop:

Java 8 or higher
Anaconda
Spark

Getting ready

The only requirement for this section is having administrative rights to install applications onto the Ubuntu Desktop.

How to do it...

This section walks through the steps in the recipe to install Python 3, Anaconda, and Spark on Ubuntu Desktop:

Install Java on Ubuntu through the terminal application, which can be found by searching for the app and then locking it to the launcher on the left-hand side, as seen in the following screenshot:

Perform an initial test for Java on the virtual machine by executing the following command at the terminal:

java -version

Execute the following four commands at the terminal to install Java:

sudo apt-get install software-properties-common 
$ sudo add-apt-repository ppa:webupd8team/java
$ sudo apt-get update
$ sudo apt-get install oracle-java8-installer

After accepting the necessary license agreements for Oracle, perform a secondary test of Java on the virtual machine by executing java -version once again in the terminal. A successful installation for Java will display the following outcome in the terminal:

$ java -version
java version "1.8.0_144"
Java(TM) SE Runtime Environment (build 1.8.0_144-b01)
Java HotSpot(TM) 64-Bit Server VM (build 25.144-b01, mixed mode)

Next, install the most recent version of Anaconda. Current versions of Ubuntu Desktop come preinstalled with Python. While it is convenient that Python comes preinstalled with Ubuntu, the installed version is for Python 2.7, as seen in the following output:

$ python --version
Python 2.7.12

The current version of Anaconda is v4.4 and the current version of Python 3 is v3.6. Once downloaded, view the Anaconda installation file by accessing the Downloads folder using the following command:

$ cd Downloads/
~/Downloads$ ls
Anaconda3-4.4.0-Linux-x86_64.sh

Once in the Downloads folder, initiate the installation for Anaconda by executing the following command:

~/Downloads$ bash Anaconda3-4.4.0-Linux-x86_64.sh 
Welcome to Anaconda3 4.4.0 (by Continuum Analytics, Inc.)
In order to continue the installation process, please review the license agreement.
Please, press ENTER to continue

Please note that the version of Anaconda, as well as any other software installed, may differ as newer updates are released to the public. The version of Anaconda that we are using in this chapter and in this book can be downloaded from https://repo.continuum.io/archive/Anaconda3-4.4.0-Linux-x86.sh

Once the Anaconda installation is complete, restart the Terminal application to confirm that Python 3 is now the default Python environment through Anaconda by executing python --version in the terminal:

$ python --version
Python 3.6.1 :: Anaconda 4.4.0 (64-bit)

The Python 2 version is still available under Linux, but will require an explicit call when executing a script, as seen in the following command:

~$ python2 --version
Python 2.7.12

Visit the following website to begin the Spark download and installation process:

https://spark.apache.org/downloads.html

Select the download link. The following file will be downloaded to the Downloads folder in Ubuntu:

spark-2.2.0-bin-hadoop2.7.tgz

View the file at the terminal level by executing the following commands:

$ cd Downloads/
~/Downloads$ ls
spark-2.2.0-bin-hadoop2.7.tgz

Extract the tgz file by executing the following command:

~/Downloads$ tar -zxvf spark-2.2.0-bin-hadoop2.7.tgz

Another look at the Downloads directory using ls shows both the tgz file and the extracted folder:

~/Downloads$ ls
spark-2.2.0-bin-hadoop2.7 spark-2.2.0-bin-hadoop2.7.tgz

Move the extracted folder from the Downloads folder to the Home folder by executing the following command:

~/Downloads$ mv spark-2.2.0-bin-hadoop2.7 ~/
~/Downloads$ ls
spark-2.2.0-bin-hadoop2.7.tgz
~/Downloads$ cd
~$ ls
anaconda3 Downloads Pictures Templates
Desktop examples.desktop Public Videos
Documents Music spark-2.2.0-bin-hadoop2.7

Now, the spark-2.2.0-bin-hadoop2.7 folder has been moved to the Home folder, which can be viewed when selecting the Files icon on the left-hand side toolbar, as seen in the following screenshot:

Spark is now installed. Initiate Spark from the terminal by executing the following script at the terminal level:

~$ cd ~/spark-2.2.0-bin-hadoop2.7/
~/spark-2.2.0-bin-hadoop2.7$ ./bin/pyspark

Perform a final test to ensure Spark is up and running at the terminal by executing the following command to ensure that the SparkContext is driving the cluster in the local environment:

>>> sc
<SparkContext master=local[*] appName=PySparkShell>

How it works...

This section explains the reasoning behind the installation process for Python, Anaconda, and Spark.

Spark runs on the Java virtual machine (JVM), the Java Software Development Kit (SDK) is a prerequisite installation for Spark to run on an Ubuntu virtual machine.

In order for Spark to run on a local machine or in a cluster, a minimum version of Java 6 is required for installation.

Ubuntu recommends the sudo apt install method for Java as it ensures that packages downloaded are up to date.
Please note that if Java is not currently installed, the output in the terminal will show the following message:

The program 'java' can be found in the following packages:
* default-jre
* gcj-5-jre-headless
* openjdk-8-jre-headless
* gcj-4.8-jre-headless
* gcj-4.9-jre-headless
* openjdk-9-jre-headless
Try: sudo apt install <selected package>

While Python 2 is fine, it is considered legacy Python. Python 2 is facing an end of life date in 2020; therefore, it is recommended that all new Python development be performed with Python 3, as will be the case in this publication. Up until recently, Spark was only available with Python 2. That is no longer the case. Spark works with both Python 2 and 3. A convenient way to install Python 3, as well as many dependencies and libraries, is through Anaconda. Anaconda is a free and open source distribution of Python, as well as R. Anaconda manages the installation and maintenance of many of the most common packages used in Python for data science-related tasks.
During the installation process for Anaconda, it is important to confirm the following conditions:
- Anaconda is installed in the /home/username/Anaconda3 location
- The Anaconda installer prepends the Anaconda3 install location to a PATH in /home/username/.bashrc

After Anaconda has been installed, download Spark. Unlike Python, Spark does not come preinstalled on Ubuntu and therefore, will need to be downloaded and installed.
For the purposes of development with deep learning, the following preferences will be selected for Spark:
- Spark release: 2.2.0 (Jul 11 2017)
- Package type: Prebuilt for Apache Hadoop 2.7 and later
- Download type: Direct download
Once Spark has been successfully installed, the output from executing Spark at the command line should look something similar to that shown in the following screenshot:

Two important features to note when initializing Spark are that it is under the Python 3.6.1 | Anaconda 4.4.0 (64-bit) | framework and that the Spark logo is version 2.2.0.
Congratulations! Spark is successfully installed on the local Ubuntu virtual machine. But, not everything is complete. Spark development is best when Spark code can be executed within a Jupyter notebook, especially for deep learning. Thankfully, Jupyter has been installed with the Anaconda distribution performed earlier in this section.

There's more...

You may be asking why we did not just use pip install pyspark to use Spark in Python. Previous versions of Spark required going through the installation process that we did in this section. Future versions of Spark, starting with 2.2.0 will begin to allow installation directly through the pip approach. We used the full installation method in this section to ensure that you will be able to get Spark installed and fully-integrated, in case you are using an earlier version of Spark.

Integrating Jupyter notebooks with Spark

When learning Python for the first time, it is useful to use Jupyter notebooks as an interactive developing environment (IDE). This is one of the main reasons why Anaconda is so powerful. It fully integrates all of the dependencies between Python and Jupyter notebooks. The same can be done with PySpark and Jupyter notebooks. While Spark is written in Scala, PySpark allows for the translation of code to occur within Python instead.

Getting ready

Most of the work in this section will just require accessing the .bashrc script from the terminal.

How to do it...

PySpark is not configured to work within Jupyter notebooks by default, but a slight tweak of the .bashrc script can remedy this issue. We will walk through these steps in this section:

Access the .bashrc script by executing the following command:

$ nano .bashrc

Scrolling all the way to the end of the script should reveal the last command modified, which should be the PATH set by Anaconda during the installation earlier in the previous section. The PATH should appear as seen in the following:

# added by Anaconda3 4.4.0 installer
export PATH="/home/asherif844/anaconda3/bin:$PATH"

Underneath, the PATH added by the Anaconda installer can include a custom function that helps communicate the Spark installation with the Jupyter notebook installation from Anaconda3. For the purposes of this chapter and remaining chapters, we will name that function sparknotebook. The configuration should appear as the following for sparknotebook():

function sparknotebook()
{
export SPARK_HOME=/home/asherif844/spark-2.2.0-bin-hadoop2.7
export PYSPARK_PYTHON=python3
export PYSPARK_DRIVER_PYTHON=jupyter
export PYSPARK_DRIVER_PYTHON_OPTS="notebook"
$SPARK_HOME/bin/pyspark
}

The updated .bashrc script should look like the following once saved:

Save and exit from the .bashrc file. It is recommended to communicate that the .bashrc file has been updated by executing the following command and restarting the terminal application:

$ source .bashrc

How it works...

Our goal in this section is to integrate Spark directly into a Jupyter notebook so that we are not doing our development at the terminal and instead utilizing the benefits of developing within a notebook. This section explains how the Spark integration within a Jupyter notebook takes place.

We will create a command function, sparknotebook, that we can call from the terminal to open up a Spark session through Jupyter notebooks from the Anaconda installation. This requires two settings to be set in the .bashrc file:
1. PySpark Python be set to python 3
2. PySpark driver for python to be set to Jupyter
The sparknotebook function can now be accessed directly from the terminal by executing the following command:

$ sparknotebook

The function should then initiate a brand new Jupyter notebook session through the default web browser. A new Python script within Jupyter notebooks with a .ipynb extension can be created by clicking on the New button on the right-hand side and by selecting Python 3 under Notebook: as seen in the following screenshot:

Once again, just as was done at the terminal level for Spark, a simple script of sc will be executed within the notebook to confirm that Spark is up and running through Jupyter:

Ideally, the Version, Master, and AppName should be identical to the earlier output when sc was executed at the terminal. If this is the case, then PySpark has been successfully installed and configured to work with Jupyter notebooks.

There's more...

It is important to note that if we were to call a Jupyter notebook through the terminal without specifying sparknotebook, our Spark session will never be initiated and we will receive an error when executing the SparkContext script.

We can access a traditional Jupyter notebook by executing the following at the terminal:

jupyter-notebook

Once we start the notebook, we can try and execute the same script for sc.master as we did previously, but this time we will receive the following error:

Starting and configuring a Spark cluster

For most chapters, one of the first things that we will do is to initialize and configure our Spark cluster.

Getting ready

Import the following before initializing cluster.

from pyspark.sql import SparkSession

How to do it...

This section walks through the steps to initialize and configure a Spark cluster.

Import SparkSession using the following script:

from pyspark.sql import SparkSession

Configure SparkSession with a variable named spark using the following script:

spark = SparkSession.builder \
    .master("local[*]") \
    .appName("GenericAppName") \
    .config("spark.executor.memory", "6gb") \
.getOrCreate()

How it works...

This section explains how the SparkSession works as an entry point to develop within Spark.

Staring with Spark 2.0, it is no longer necessary to create a SparkConf and SparkContext to begin development in Spark. Those steps are no longer needed as importing SparkSession will handle initializing a cluster. Additionally, it is important to note that SparkSession is part of the sql module from pyspark.
We can assign properties to our SparkSession:
1. master: assigns the Spark master URL to run on our local machine with the maximum available number of cores.
2. appName: assign a name for the application
3. config: assign 6gb to the spark.executor.memory
4. getOrCreate: ensures that a SparkSession is created if one is not available and retrieves an existing one if it is available

There's more...

For development purposes, while we are building an application on smaller datasets, we can just use master("local"). If we were to deploy on a production environment, we would want to specify master("local[*]") to ensure we are using the maximum cores available and get optimal performance.