Home Data Apache Spark Deep Learning Cookbook

Apache Spark Deep Learning Cookbook

By Ahmed Sherif , Amrith Ravindra
books-svg-icon Book
eBook $43.99 $29.99
Print $54.99
Subscription $15.99 $10 p/m for three months
$10 p/m for first 3 months. $15.99 p/m after that. Cancel Anytime!
What do you get with a Packt Subscription?
This book & 7000+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with a Packt Subscription?
This book & 6500+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with eBook + Subscription?
Download this book in EPUB and PDF formats, plus a monthly download credit
This book & 6500+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with a Packt Subscription?
This book & 6500+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with eBook?
Download this book in EPUB and PDF formats
Access this title in our online reader
DRM FREE - Read whenever, wherever and however you want
Online reader with customised display settings for better reading experience
What do you get with video?
Download this video in MP4 format
Access this title in our online reader
DRM FREE - Watch whenever, wherever and however you want
Online reader with customised display settings for better learning experience
What do you get with video?
Stream this video
Access this title in our online reader
DRM FREE - Watch whenever, wherever and however you want
Online reader with customised display settings for better learning experience
What do you get with Audiobook?
Download a zip folder consisting of audio files (in MP3 Format) along with supplementary PDF
What do you get with Exam Trainer?
Flashcards, Mock exams, Exam Tips, Practice Questions
Access these resources with our interactive certification platform
Mobile compatible-Practice whenever, wherever, however you want
BUY NOW $10 p/m for first 3 months. $15.99 p/m after that. Cancel Anytime!
eBook $43.99 $29.99
Print $54.99
Subscription $15.99 $10 p/m for three months
What do you get with a Packt Subscription?
This book & 7000+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with a Packt Subscription?
This book & 6500+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with eBook + Subscription?
Download this book in EPUB and PDF formats, plus a monthly download credit
This book & 6500+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with a Packt Subscription?
This book & 6500+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with eBook?
Download this book in EPUB and PDF formats
Access this title in our online reader
DRM FREE - Read whenever, wherever and however you want
Online reader with customised display settings for better reading experience
What do you get with video?
Download this video in MP4 format
Access this title in our online reader
DRM FREE - Watch whenever, wherever and however you want
Online reader with customised display settings for better learning experience
What do you get with video?
Stream this video
Access this title in our online reader
DRM FREE - Watch whenever, wherever and however you want
Online reader with customised display settings for better learning experience
What do you get with Audiobook?
Download a zip folder consisting of audio files (in MP3 Format) along with supplementary PDF
What do you get with Exam Trainer?
Flashcards, Mock exams, Exam Tips, Practice Questions
Access these resources with our interactive certification platform
Mobile compatible-Practice whenever, wherever, however you want
  1. Free Chapter
    Setting Up Spark for Deep Learning Development
About this book
Organizations these days need to integrate popular big data tools such as Apache Spark with highly efficient deep learning libraries if they’re looking to gain faster and more powerful insights from their data. With this book, you’ll discover over 80 recipes to help you train fast, enterprise-grade, deep learning models on Apache Spark. Each recipe addresses a specific problem, and offers a proven, best-practice solution to difficulties encountered while implementing various deep learning algorithms in a distributed environment. The book follows a systematic approach, featuring a balance of theory and tips with best practice solutions to assist you with training different types of neural networks such as convolutional neural networks (CNNs) and recurrent neural networks (RNNs). You’ll also have access to code written in TensorFlow and Keras that you can run on Spark to solve a variety of deep learning problems in computer vision and natural language processing (NLP), or tweak to tackle other problems encountered in deep learning. By the end of this book, you'll have the skills you need to train and deploy state-of-the-art deep learning models on Apache Spark.
Publication date:
July 2018
Publisher
Packt
Pages
474
ISBN
9781788474221

 

Chapter 1. Setting Up Spark for Deep Learning Development

In this chapter, the following recipes will be covered:

  • Downloading an Ubuntu Desktop image
  • Installing and configuring Ubuntu with VMWare Fusion on macOS
  • Installing and configuring Ubuntu with Oracle VirtualBox on Windows
  • Installing and configuring Ubuntu Desktop for Google Cloud Platform
  • Installing and configuring Spark and prerequisites on Ubuntu Desktop
  • Integrating Jupyter notebooks with Spark
  • Starting and configuring a Spark cluster
  • Stopping a Spark cluster
 

Introduction


Deep learning is the focused study of machine learning algorithms that deploy neural networks as their main method of learning. Deep learning has exploded onto the scene just within the last couple of years. Microsoft, Google, Facebook, Amazon, Apple, Tesla and many other companies are all utilizing deep learning models in their apps, websites, and products. At the same exact time, Spark, an in-memory compute engine running on top of big data sources, has made it easy to process volumes of information at record speeds and ease. In fact, Spark has now become the leading big data development tool for data engineers, machine learning engineers, and data scientists.

Since deep learning models perform better with more data, the synergy between Spark and deep learning allowed for a perfect marriage. Almost as important as the code used to execute deep learning algorithms is the work environment that enables optimal development. Many talented minds are eager to develop neural networks to help answer important questions in their research. Unfortunately, one of the greatest barriers to the development of deep learning models is access to the necessary technical resources required to learn on big data. The purpose of this chapter is to create an ideal virtual development environment for deep learning on Spark.

 

Downloading an Ubuntu Desktop image


Spark can be set up for all types of operating systems, whether they reside on-premise or in the cloud. For our purposes, Spark will be installed on a Linux-based virtual machine with Ubuntu as the operating system. There are several advantages to using Ubuntu as the go-to virtual machine, not least of which is cost. Since they are based on open source software, Ubuntu operating systems are free to use and do not require licensing. Cost is always a consideration and one of the main goals of this publication is to minimize the financial footprint required to get started with deep learning on top of a Spark framework.

Getting ready

There are some minimum recommendations required for downloading the image file:

  • Minimum of 2 GHz dual-core processor
  • Minimum of 2 GB system memory
  • Minimum of 25 GB of free hard drive space

How to do it...

Follow the steps in the recipe to download an Ubuntu Desktop image:

  1. In order to create a virtual machine of Ubuntu Desktop, it is necessary to first download the file from the official website: https://www.ubuntu.com/download/desktop.
  2. As of this writing, Ubuntu Desktop 16.04.3 is the most recent available version for download.

  1. Access the following file in a .iso format once the download is complete:

    ubuntu-16.04.3-desktop-amd64.iso

How it works...

Virtual environments provide an optimal development workspace by isolating the relationship to the physical or host machine. Developers may be using all types of machines for their host environments such as a MacBook running macOS, a Microsoft Surface running Windows or even a virtual machine on the cloud with Microsoft Azure or AWS; however, to ensure consistency within the output of the code executed, a virtual environment within Ubuntu Desktop will be deployed that can be used and shared among a wide variety of host platforms.

There's more...

There are several options for desktop virtualization software, depending on whether the host environment is on a Windows or a macOS. There are two common software applications for virtualization when using macOS:

  • VMWare Fusion
  • Parallels

See also

To learn more about Ubuntu Desktop, you can visit https://www.ubuntu.com/desktop.

 

Installing and configuring Ubuntu with VMWare Fusion on macOS


This section will focus on building a virtual machine using an Ubuntu operating system with VMWare Fusion.

Getting ready

A previous installation of VMWare Fusion is required on your system. If you do not currently have this, you can download a trial version from the following website:

https://www.vmware.com/products/fusion/fusion-evaluation.html

How to do it...

Follow the steps in the recipe to configure Ubuntu with VMWare Fusion on macOS:

  1. Once VMWare Fusion is up and running, click on the + button on the upper-left-hand side to begin the configuration process and select New..., as seen in the following screenshot:

  1. Once the selection has been made, select the option to Install from Disk or Image, as seen in the following screenshot:

  1. Select the operating system's iso file that was downloaded from the Ubuntu Desktop website, as seen in the following screenshot:

  1. The next step will ask whether you want to choose Linux Easy Install. It is recommended to do so, as well as to incorporate a Display Name/Password combination for the Ubuntu environment, as seen in the following screenshot:

  1. The configuration process is almost complete. A Virtual Machine Summary is displayed with the option to Customize Settings to increase the Memory and Hard Disk, as seen in the following screenshot:

  1. Anywhere from 20 to 40 GB hard disk space is sufficient for the virtual machine; however, bumping up the memory to either 2 GB or even 4 GB will assist with the performance of the virtual machine when executing Spark code in later chapters. Update the memory by selecting Processors and Memory under the Settings of the virtual machine and increasing the Memory to the desired amount, as seen in the following screenshot:

How it works...

The setup allows for manual configuration of the settings necessary to get Ubuntu Desktop up and running successfully on VMWare Fusion. The memory and hard drive storage can be increased or decreased based on the needs and availability of the host machine.

There's more...

All that is remaining is to fire up the virtual machine for the first time, which initiates the installation process of the system onto the virtual machine. Once all the setup is complete and the user has logged in, the Ubuntu virtual machine should be available for development, as seen in the following screenshot:

See also

Aside from VMWare Fusion, there is also another product that offers similar functionality on a Mac. It is called Parallels Desktop for Mac. To learn more about VMWare and Parallels, and decide which program is a better fit for your development, visit the following websites:

 

Installing and configuring Ubuntu with Oracle VirtualBox on Windows


Unlike with macOS, there areseveral options to virtualize systems within Windows. This mainly has to do with the fact that virtualization on Windows is very common as most developers are using Windows as their host environment and need virtual environments for testing purposes without affecting any of the dependencies that rely on Windows.

Getting ready

VirtualBox from Oracle is a common virtualization product and is free to use. Oracle VirtualBox provides a straightforward process to get an Ubuntu Desktop virtual machine up and running on top of a Windows environment.

How to do it...

Follow the steps in this recipe to configure Ubuntu with VirtualBox on Windows:

  1. Initiate an Oracle VM VirtualBox Manager. Next, create a new virtual machine by selecting the New icon and specify the Name, Type, and Version of the machine, as seen in the following screenshot:
  1. Select Expert Mode as several of the configuration steps will get consolidated, as seen in the following screenshot:

Ideal memory size should be set to at least 2048 MB, or preferably 4096 MB, depending on the resources available on the host machine.

  1. Additionally, set an optimal hard disk size for an Ubuntu virtual machine performing deep learning algorithms to at least 20 GB, if not more, as seen in the following screenshot:
  1. Point the virtual machine manager to the start-up disk location where the Ubuntu iso file was downloaded to and then Start the creation process, as seen in the following screenshot:
  1. After allotting some time for the installation, select the Start icon to complete the virtual machine and get it ready for development as seen in the following screenshot:

How it works...

The setup allows for manual configuration of the settings necessary to get Ubuntu Desktop up and running successfully on Oracle VirtualBox. As was the case with VMWare Fusion, the memory and hard drive storage can be increased or decreased based on the needs and availability of the host machine.

There's more...

Please note that some machines that run Microsoft Windows are not set up by default for virtualization and users may receive an initial error indicating the VT-x is not enabled. This can be reversed and virtualization may be enabled in the BIOS during a reboot.

See also

To learn more about Oracle VirtualBox and decide whether or not it is a good fit, visit the following website and select Windows hosts to begin the download process: https://www.virtualbox.org/wiki/Downloads.

 

Installing and configuring Ubuntu Desktop for Google Cloud Platform


Previously, we saw how Ubuntu Desktop could be set up locally using VMWare Fusion. In this section, we will learn how to do the same on Google Cloud Platform.

Getting ready

The only requirement is a Google account username. Begin by logging in to your Google Cloud Platform using your Google account. Google provides a free 12-month subscription with $300 credited to your account. The setup will ask for your bank details; however, Google will not charge you for anything without explicitly letting you know first. Go ahead and verify your bank account and you are good to go.

How to do it...

Follow the steps in the recipe to configure Ubuntu Desktop for Google Cloud Platform:

  1. Once logged in to yourGoogle Cloud Platform, access a dashboard that looks like the one in the following screenshot:

Google Cloud Platform Dashboard

  1. First, click on the product services button in the top-left-hand corner of your screen. In the drop-down menu, underCompute, click onVM instances,as shown in the following screenshot:
  1. Create a new instance and name it. We are naming itubuntuvm1 in our case. Google Cloud automatically creates a project while launching an instance and the instance will be launched under a project ID. The project may be renamed if required.

  1. After clicking on Create Instance, select the zone/area you are located in.
  2. Select Ubuntu 16.04LTS under the boot disk as this is the operating system that will be installed in the cloud. Please note that LTS stands for version, and will have long-term support from Ubuntu’s developers.
  3. Next, under the boot disk options, select SSD persistent disk and increase the size to 50 GB for some added storage space for the instance, as shown in the following screenshot:

  1. Next, set Access scopes to Allow full access to all Cloud APIs.
  2. Under firewall, please check to allow HTTP traffic as well as allow HTTPS traffic, as shown in the following screenshot:

Selecting options  Allow HTTP traffic and HTTPS Traffic

  1. Once the instance is configured as shown in this section, go ahead and create the instance by clicking on the Create button.

Note

After clicking on the Create button, you will notice that the instance gets created with a unique internal as well as external IP address. We will require this at a later stage. SSH refers to secure shell tunnel, which is basically an encrypted way of communicating in client-server architectures. Think of it as data going to and from your laptop, as well as going to and from Google's cloud servers, through an encrypted tunnel.

  1.  Click on the newly created instance. From the drop-down menu, click onopen in browser window, as shown in the following screenshot:
  1. You will see that Google opens up a shell/terminal in a new window, as shown in the following screenshot:
  1. Once the shell is open, you should have a window that looks like the following screenshot:
  1. Type the following commands in the Google cloud shell:
$ sudo apt-get update
$ sudo apt-get upgrade
$ sudo apt-get install gnome-shell
$ sudo apt-get install ubuntu-gnome-desktop
$ sudo apt-get install autocutsel
$ sudo apt-get install gnome-core
$ sudo apt-get install gnome-panel
$ sudo apt-get install gnome-themes-standard
  1. When presented with a prompt to continue or not, typeyand select ENTER, as shown in the following screenshot:
  1. Once done with the preceding steps, type the following commands to set up the vncserver and allow connections to the local shell:
$ sudo apt-get install tightvncserver
$ touch ~/.Xresources
  1. Next, launch the server by typing the following command:
$ tightvncserver
  1. This will prompt you to enter a password, which will later be used to log in to the Ubuntu Desktop virtual machine. This password is limited to eight characters and needs to be set and verified, as shown in the following screenshot:
  1. A startup script is automatically generated by the shell, as shown in the following screenshot. This startup script can be accessed and edited by copying and pasting its PATH in the following manner:

  1. In our case, the command to view and edit the script is:
:~$ vim /home/amrith2kmeanmachine/.vnc/xstartup

This PATH may be different in each case. Ensure you set the right PATH. The vim command opens up the script in the text editor on a Mac.

Note

The local shell generated a startup script as well as a log file. The startup script needs to be opened and edited in a text editor, which will be discussed next.

  1. After typing thevimcommand, the screen with the startup script should look something like this screenshot:

  1. Type i to enter INSERT mode. Next, delete all the text in the startup script. It should then look like the following screenshot:
  1. Copy paste the following code into the startup script:
#!/bin/sh
autocutsel -fork
xrdb $HOME/.Xresources
xsetroot -solid grey
export XKL_XMODMAP_DISABLE=1
export XDG_CURRENT_DESKTOP="GNOME-Flashback:Unity"
export XDG_MENU_PREFIX="gnome-flashback-"
unset DBUS_SESSION_BUS_ADDRESS
gnome-session --session=gnome-flashback-metacity --disable-acceleration-check --debug &
  1. The script should appear in the editor, as seen in the following screenshot:
  1. Press Esc to exit out of INSERT mode and type :wq to write and quit the file.
  2. Once the startup script has been configured, type the following command in the Google shell to kill the server and save the changes:
$ vncserver -kill :1
  1. This command should produce a process ID that looks like the one in the following screenshot:
  1. Start the server again by typing the following command:
$ vncserver -geometry 1024x640

The next series of steps will focus on securing the shell tunnel into the Google Cloud instance from the local host. Before typing anything on the local shell/terminal, ensure that Google Cloud is installed. If not already installed, do so by following the instructions in this quick-start guide located at the following website:

https://cloud.google.com/sdk/docs/quickstart-mac-os-x

  1. Once Google Cloud is installed, open up the terminal on your machine and type the following commands to connect to the Google Cloud compute instance:
$ gcloud compute ssh \
YOUR INSTANCE NAME HERE \
--project YOUR PROJECT NAME HERE \
--zone YOUR TIMEZONE HERE \
--ssh-flag "-L 5901:localhost:5901"
  1. Ensure that the instance name, project ID, and zone are specified correctly in the preceding commands. On pressing ENTER, the output on the local shell changes to what is shown in the following screenshot:
  1. Once you see the name of your instance followed by ":~$", it means that a connection has successfully been established between the local host/laptop and the Google Cloud instance. After successfully SSHing into the instance, we require software called VNC Viewerto view and interact with the Ubuntu Desktop that has now been successfully set up on the Google Cloud Compute engine. The following few steps will discuss how this is achieved.
  1. VNC Viewer may be downloaded using the following link:

https://www.realvnc.com/en/connect/download/viewer/

  1. Once installed, click to open VNC Viewer and in the search bar, type in localhost::5901, as shown in the following screenshot:
  1. Next, click on continue when prompted with the following screen:
  1. This will prompt you to enter your password for the virtual machine. Enter the password that you set earlier while launching the tightvncserver command for the first time, as shown in the following screenshot:

  1. You will finally be taken into the desktop of your Ubuntu virtual machine on Google Cloud Compute. Your Ubuntu Desktop screen must now look something like the following screenshot when viewed on VNC Viewer:

How it works...

You have now successfully set up VNC Viewer for interactions with the Ubuntu virtual machine/desktop. Anytime the Google Cloud instance is not in use, it is recommended to suspend or shut down the instance so that additional costs are not being incurred. The cloud approach is optimal for developers who may not have access to physical resources with high memory and storage.

There's more...

While we discussed Google Cloud as a cloud option for Spark,  it is possible to leverage Spark on the following cloud platforms as well:

  • Microsoft Azure
  • Amazon Web Services

See also

In order to learn more about Google Cloud Platform and sign up for a free subscription, visit the following website:

https://cloud.google.com/

 

Installing and configuring Spark and prerequisites on Ubuntu Desktop


Before Spark can get up and running, there are some necessary prerequisites that need to be installed on a newly minted Ubuntu Desktop. This section will focus on installing and configuring the following on Ubuntu Desktop:

  • Java 8 or higher
  • Anaconda
  • Spark

Getting ready

The only requirement for this section is having administrative rights to install applications onto the Ubuntu Desktop.

How to do it...

This section walks through the steps in the recipe to install Python 3, Anaconda, and Spark on Ubuntu Desktop:

  1. Install Java on Ubuntu through the terminal application, which can be found by searching for the app and then locking it to the launcher on the left-hand side, as seen in the following screenshot:
  1. Perform an initial test for Java on the virtual machine by executing the following command at the terminal:
java -version
  1. Execute the following four commands at the terminal to install Java:
sudo apt-get install software-properties-common 
$ sudo add-apt-repository ppa:webupd8team/java
$ sudo apt-get update
$ sudo apt-get install oracle-java8-installer
  1. After accepting the necessary license agreements for Oracle, perform a secondary test of Java on the virtual machine by executing java -version once again in the terminal. A successful installation for Java will display the following outcome in the terminal:
$ java -version
java version "1.8.0_144"
Java(TM) SE Runtime Environment (build 1.8.0_144-b01)
Java HotSpot(TM) 64-Bit Server VM (build 25.144-b01, mixed mode)
  1. Next, install the most recent version of Anaconda. Current versions of Ubuntu Desktop come preinstalled with Python. While it is convenient that Python comes preinstalled with Ubuntu, the installed version is for Python 2.7, as seen in the following output:
$ python --version
Python 2.7.12
  1. The current version of Anaconda is v4.4 and the current version of Python 3 is v3.6. Once downloaded, view the Anaconda installation file by accessing the Downloadsfolder using the following command:
$ cd Downloads/
~/Downloads$ ls
Anaconda3-4.4.0-Linux-x86_64.sh
  1. Once in the Downloads folder, initiate the installation for Anaconda by executing the following command:
~/Downloads$ bash Anaconda3-4.4.0-Linux-x86_64.sh 
Welcome to Anaconda3 4.4.0 (by Continuum Analytics, Inc.)
In order to continue the installation process, please review the license agreement.
Please, press ENTER to continue

Note

Please note that the version of Anaconda, as well as any other software installed, may differ as newer updates are released to the public. The version of Anaconda that we are using in this chapter and in this book can be downloaded from https://repo.continuum.io/archive/Anaconda3-4.4.0-Linux-x86.sh

  1. Once the Anaconda installation is complete, restart the Terminal application to confirm that Python 3 is now the default Python environment through Anaconda by executing python --version in the terminal:
$ python --version
Python 3.6.1 :: Anaconda 4.4.0 (64-bit)
  1. The Python 2 version is still available under Linux, but will require an explicit call when executing a script, as seen in the following command:
~$ python2 --version
Python 2.7.12
  1. Visit the following website to begin the Spark download and installation process:

https://spark.apache.org/downloads.html

  1. Select the download link. The following file will be downloaded to theDownloadsfolder in Ubuntu:

spark-2.2.0-bin-hadoop2.7.tgz

  1. View the file at the terminal level by executing the following commands:
$ cd Downloads/
~/Downloads$ ls
spark-2.2.0-bin-hadoop2.7.tgz
  1. Extract the tgz file by executing the following command:
~/Downloads$ tar -zxvf spark-2.2.0-bin-hadoop2.7.tgz
  1. Another look at theDownloads directory using ls shows both the tgz file and the extracted folder:
~/Downloads$ ls
spark-2.2.0-bin-hadoop2.7 spark-2.2.0-bin-hadoop2.7.tgz
  1. Move the extracted folder from the Downloads folder to the Home folder by executing the following command:
~/Downloads$ mv spark-2.2.0-bin-hadoop2.7 ~/
~/Downloads$ ls
spark-2.2.0-bin-hadoop2.7.tgz
~/Downloads$ cd
~$ ls
anaconda3 Downloads Pictures Templates
Desktop examples.desktop Public Videos
Documents Music spark-2.2.0-bin-hadoop2.7
  1. Now, the spark-2.2.0-bin-hadoop2.7 folder has been moved to the Home folder, which can be viewed when selecting the Files icon on the left-hand side toolbar, as seen in the following screenshot:
  1. Spark is now installed. Initiate Spark from the terminal by executing the following script at the terminal level:
~$ cd ~/spark-2.2.0-bin-hadoop2.7/
~/spark-2.2.0-bin-hadoop2.7$ ./bin/pyspark
  1. Perform a final test to ensure Spark is up and running at the terminal by executing the following command to ensure that the SparkContext is driving the cluster in the local environment:
>>> sc
<SparkContext master=local[*] appName=PySparkShell>

How it works...

This section explains the reasoning behind the installation process for Python, Anaconda, and Spark.

  1. Spark runs on the Java virtual machine (JVM), the Java Software Development Kit (SDK) is a prerequisite installation for Spark to run on an Ubuntu virtual machine.

Note

In order for Spark to run on a local machine or in a cluster, a minimum version of Java 6 is required for installation.

  1. Ubuntu recommends the sudo apt install method for Java as it ensures that packages downloaded are up to date. 
  2. Please note that if Java is not currently installed, the output in the terminal will show the following message:
The program 'java' can be found in the following packages:
* default-jre
* gcj-5-jre-headless
* openjdk-8-jre-headless
* gcj-4.8-jre-headless
* gcj-4.9-jre-headless
* openjdk-9-jre-headless
Try: sudo apt install <selected package>
  1. While Python 2 is fine, it is considered legacy Python. Python 2 is facing an end of life date in 2020; therefore, it is recommended that all new Python development be performed with Python 3, as will be the case in this publication. Up until recently, Spark was only available with Python 2. That is no longer the case. Spark works with both Python 2 and 3. A convenient way to install Python 3, as well as many dependencies and libraries, is through Anaconda. Anaconda is a free and open source distribution of Python, as well as R. Anaconda manages the installation and maintenance of many of the most common packages used in Python for data science-related tasks.
  2. During the installation process for Anaconda, it is important to confirm the following conditions: 

    • Anaconda is installed in the /home/username/Anaconda3 location
    • The Anaconda installer prepends the Anaconda3 install location to a PATH in /home/username/.bashrc
  1. After Anaconda has been installed, download Spark. Unlike Python, Spark does not come preinstalled on Ubuntu and therefore, will need to be downloaded and installed.
  2. For the purposes of development with deep learning, the following preferences will be selected for Spark:

    • Spark release2.2.0 (Jul 11 2017)
    • Package type: Prebuilt for Apache Hadoop 2.7 and later
    • Download type: Direct download
  3. Once Spark has been successfully installed, the output from executing Spark at the command line should look something similar to that shown in the following screenshot:

  1. Two important features to note when initializing Spark are that it is under the Python 3.6.1 | Anaconda 4.4.0 (64-bit) | framework and that the Spark logo is version 2.2.0.
  2. Congratulations! Spark is successfully installed on the local Ubuntu virtual machine. But, not everything is complete. Spark development is best when Spark code can be executed within a Jupyter notebook, especially for deep learning. Thankfully, Jupyter has been installed with the Anaconda distribution performed earlier in this section.

There's more...

You may be asking why we did not just use pip install pyspark to use Spark in Python. Previous versions of Spark required going through the installation process that we did in this section. Future versions of Spark, starting with 2.2.0 will begin to allow installation directly through the pip approach. We used the full installation method in this section to ensure that you will be able to get Spark installed and fully-integrated, in case you are using an earlier version of Spark.

See also

To learn more about Jupyter notebooks and their integration with Python, visit the following website:

http://jupyter.org

To learn more about Anaconda and download a version for Linux, visit the following website: 

https://www.anaconda.com/download/.

 

Integrating Jupyter notebooks with Spark


When learning Python for the first time, it is useful to use Jupyter notebooks as an interactive developing environment (IDE). This is one of the main reasons why Anaconda is so powerful. It fully integrates all of the dependencies between Python and Jupyter notebooks. The same can be done with PySpark and Jupyter notebooks. While Spark is written in Scala, PySpark allows for the translation of code to occur within Python instead.

Getting ready

Most of the work in this section will just require accessing the .bashrc script from the terminal.

How to do it...

PySpark is not configured to work within Jupyter notebooks by default, but a slight tweak of the .bashrc script can remedy this issue. We will walk through these steps in this section:

  1. Access the .bashrc script by executing the following command:
$ nano .bashrc
  1. Scrolling all the way to the end of the script should reveal the last command modified, which should be the PATH set by Anaconda during the installation earlier in the previous section. The PATH should appear as seen in the following:
# added by Anaconda3 4.4.0 installer
export PATH="/home/asherif844/anaconda3/bin:$PATH"
  1. Underneath, the PATH added by the Anaconda installer can include a custom function that helps communicate the Spark installation with the Jupyter notebook installation from Anaconda3. For the purposes of this chapter and remaining chapters, we will name that function sparknotebook. The configuration should appear as the following for sparknotebook():
function sparknotebook()
{
export SPARK_HOME=/home/asherif844/spark-2.2.0-bin-hadoop2.7
export PYSPARK_PYTHON=python3
export PYSPARK_DRIVER_PYTHON=jupyter
export PYSPARK_DRIVER_PYTHON_OPTS="notebook"
$SPARK_HOME/bin/pyspark
}
  1. The updated .bashrc script should look like the following once saved:
  1. Save and exit from the .bashrc file. It is recommended to communicate that the .bashrc file has been updated by executing the following command and restarting the terminal application:
$ source .bashrc

How it works...

Our goal in this section is to integrate Spark directly into a Jupyter notebook so that we are not doing our development at the terminal and instead utilizing the benefits of developing within a notebook. This section explains how the Spark integration within a Jupyter notebook takes place.

  1. We will create a command function, sparknotebook, that we can call from the terminal to open up a Spark session through Jupyter notebooks from the Anaconda installation. This requires two settings to be set in the .bashrc file:
    1. PySpark Python be set to python 3
    2. PySpark driver for python to be set to Jupyter
  2. The sparknotebook function can now be accessed directly from the terminal by executing the following command:
$ sparknotebook
  1. The function should then initiate a brand new Jupyter notebook session through the default web browser. A new Python script within Jupyter notebooks with a .ipynb extension can be created by clicking on the New button on the right-hand side and by selecting Python 3 under Notebook: as seen in the following screenshot:

  1. Once again, just as was done at the terminal level for Spark, a simple script of sc will be executed within the notebook to confirm that Spark is up and running through Jupyter:
  1. Ideally, the VersionMaster, and AppName should be identical to the earlier output when sc was executed at the terminal. If this is the case, then PySpark has been successfully installed and configured to work with Jupyter notebooks.

There's more...

It is important to note that if we were to call a Jupyter notebook through the terminal without specifying sparknotebook, our Spark session will never be initiated and we will receive an error when executing the SparkContext script.

We can access a traditional Jupyter notebook by executing the following at the terminal:

jupyter-notebook

Once we start the notebook, we can try and execute the same script for sc.master as we did previously, but this time we will receive the following error:

See also

There are many managed offerings online of companies offering Spark through a notebook interface where the installation and configuration of Spark with a notebook have already been managed for you. These are the following:

 

Starting and configuring a Spark cluster


For most chapters, one of the first things that we will do is to initialize and configure our Spark cluster.

Getting ready

Import the following before initializing cluster.

  • from pyspark.sql import SparkSession

How to do it...

This section walks through the steps to initialize and configure a Spark cluster.

  1. Import SparkSession using the following script:
from pyspark.sql import SparkSession
  1. Configure SparkSession with a variable named spark using the following script:
spark = SparkSession.builder \
    .master("local[*]") \
    .appName("GenericAppName") \
    .config("spark.executor.memory", "6gb") \
.getOrCreate()

How it works...

This section explains how the SparkSession works as an entry point to develop within Spark.

  1. Staring with Spark 2.0, it is no longer necessary to create a SparkConf and SparkContext to begin development in Spark. Those steps are no longer needed as importing SparkSession will handle initializing a cluster.  Additionally, it is important to note that SparkSession is part of the sql module from pyspark.
  2. We can assign properties to our SparkSession:
    1. master: assigns the Spark master URL to run on our local machine with the maximum available number of cores.  
    2. appName: assign a name for the application
    3.  config: assign 6gb to the spark.executor.memory
    4. getOrCreate: ensures that a SparkSession is created if one is not available and retrieves an existing one if it is available

There's more...

For development purposes, while we are building an application on smaller datasets, we can just use master("local").  If we were to deploy on a production environment, we would want to specify master("local[*]") to ensure we are using the maximum cores available and get optimal performance.

See also

To learn more about SparkSession.builder, visit the following website:

https://spark.apache.org/docs/2.2.0/api/java/org/apache/spark/sql/SparkSession.Builder.html

 

Stopping a Spark cluster


Once we are done developing on our cluster, it is ideal to shut it down and preserve resources.

How to do it...

This section walks through the steps to stop the SparkSession.

  1. Execute the following script:

spark.stop()

  1. Confirm that the session has closed by executing the following script:

sc.master

How it works...

This section explains how to confirm that a Spark cluster has been shut down.

  1. If the cluster has been shut down, you will receive the error message seen in the following screenshot when executing another Spark command in the notebook:

There's more...

Shutting down Spark clusters may not be as critical when working in a local environment; however, it will prove costly when Spark is deployed in a cloud environment where you are charged for compute power.

 

About the Authors
  • Ahmed Sherif

    Ahmed Sherif is a data scientist who has worked with data in various roles since 2005. He started off with BI solutions and transitioned to data science in 2013. In 2016, he obtained a master's in Predictive Analytics from Northwestern University, where he studied the science and application of machine learning and predictive modeling using both Python and R. Lately, he has been developing machine learning and deep learning solutions on the cloud using Azure. In 2016, he published his first book, Practical Business Intelligence. He currently works as a Technology Solution Profession in Data and AI for Microsoft.

    Browse publications by this author
  • Amrith Ravindra

    Amrith Ravindra is a machine learning enthusiast who holds degrees in electrical and industrial engineering. While pursuing his masters, he dove deeper into the world of machine learning and developed a love for data science. Graduate-level courses in engineering gave him the mathematical background to launch himself into a career in machine learning. He met Ahmed Sherif at a local data science meetup in Tampa. They decided to put their brains together to write a book on their favorite machine learning algorithms. He hopes this book will help him achieve his ultimate goal of becoming a data scientist and actively contributing to machine learning.

    Browse publications by this author
Latest Reviews (4 reviews total)
Buen contenido tiene el libro
So it is a nice book for beginner, but it lacks in detail explanation. It is definitely not your cup of tea if you are looking for some thing to implement in production like me.
Good book, more Illustrations could fill the 5th. Star :-)
Apache Spark Deep Learning Cookbook
Unlock this book and the full library FREE for 7 days
Start now