Search icon
Arrow left icon
All Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletters
Free Learning
Arrow right icon
Learning Apache Spark 2

You're reading from  Learning Apache Spark 2

Product type Book
Published in Mar 2017
Publisher Packt
ISBN-13 9781785885136
Pages 356 pages
Edition 1st Edition
Languages

Table of Contents (18) Chapters

Learning Apache Spark 2
Credits
About the Author
About the Reviewers
www.packtpub.com
Customer Feedback
Preface
Architecture and Installation Transformations and Actions with Spark RDDs ETL with Spark Spark SQL Spark Streaming Machine Learning with Spark GraphX Operating in Clustered Mode Building a Recommendation System Customer Churn Prediction Theres More with Spark

Setting up Jupyter Notebook with Spark


In this section we will look at how to setup a Jupyter notebook with Spark. For those of you who haven't yet been able to grasp the concept of the notebook environment, it is important to understand the benefits as opposed to a traditional environment. Please do note that Jupyter Notebook is one of the many options that users have.

What is a Jupyter Notebook?

A Jupyter Notebook is an interactive computational environment which can combine execution of code, integrating rich media and text and visualizing your data with numerous visualization libraries. The notebook itself is just a small web application that you can use to create documents, and add explanatory text before sharing them with your peers or colleagues. Jupyter notebooks are being used at Google. Microsoft, IBM, NASA, and Bloomberg among many other leading companies.

Setting up a Jupyter Notebook

Following are the steps to set up a Jupyter Notebook:

  • Pre-requisites - You would need Python 2.7 or Python >=3.3 for installing Jupyter Notebook.
  • Install Anaconda - Anaconda is recommended as it will install Python, Jupyter Notebook and other commonly used packages for scientific computing and data science. You can download  Anaconda from the following link: https://www.continuum.io/downloads.

Figure 11-6: Installing Anaconda-1

You can click the link to get access to the installer and download it on your Linux system:

Figure 117: Installing Anaconda-2

Once you have downloaded Anaconda, you can go ahead and install it.

Figure 11.7: Installing Anaconda-3

The installer will ask you questions arbout the install location, and walk you through the license agreement, before asking you to confirm of installation and weather it should add the path to the bashrc file. You can then start the notebook using the following command:

jupyter notebook

However, please bear in mind that by default a notebook server runs locally at 127.0.0.1:8888. If this is what you are looking for, then this is great. However, if you like to open it to the public, you will need to secure your notebook server.

Securing the notebook server

Notebook server can be protected by a simple single password by configuring NotebookApp.password setting in the following file: Jupyter_notebook_config.py.

This file should be located in your home directory: ~/.jupyter. If you have just installed Anaconda, you might not have this directory. You can create this by executing the following command:

jupyter notebook --generate-config

Running this command will create a ~/.jupyter directory and will create a default configuration file:

Figure 11.9: Securing Jupyter for public access

Preparing a hashed password

You can use Jupyter to create a hashed password or prepare it manually.

Using Jupyter (only with version 5.0 and later)

You can issue the following command to create a hashed password:

jupyter notebook password

This will save the password in your ~/.jupyter director in a file called jupyter_notebook_config.json.

Manually creating hashed password

You can use Python to manually create the hashed password:

Figure 11.10: Manually creating a hashed password

You can use either of these passwords in your jupyter_notebook_config.py and replace the parameter value for c.NotebookApp.password.

c.NotebookApp.password = u'sha1:cd7ef63fc00a:2816fd7ed6a47ac9aeaa2477c1587fd18ab1ecdc'

Figure 11-11: Using the generated hashed password

By default the Notebook runs on port 8888; you'll see the option to change the port as well.

Since we want to allow public access to the notebook, we have to allow all IP's to access the notebook using any of the configured network interfaces for the public server. This can be done by making the following changes:

Figure 11.12: Configuring Notebook server to listen on all interfaces

You can now run Jupyter, and access it from any computer with access to the notebook server:

Figure 11-13: Jupyter interface

Setting up PySpark on Jupyter

The next step is to integrate PySpark with Jupyter notebook. You have to do following steps to setup PySpark:

  1. Update your bashrc file and set the following variables:

            # added by Anaconda3 4.3.0 installer
            export PATH="/root/anaconda3/bin:$PATH"
            PYSPARK_PYTHON=/usr/bin/python
            PYSPARK_DRIVER_PYTHON=/usr/bin/python
            SPARK_HOME=/spark/spark-2.0.2/
            PATH=$PATH:/spark/spark-2.0.2/bin
            PYSPARK_DRIVER_PYTHON=jupyter
            PYSPARK_DRIVER_PYTHON_OPTS=notebook
  2. Configure PySpark Kernel: Create a file /usr/local/share/jupyter/kernels/pyspark/kernel.json with the following parameters:

            {
              "display_name": "PySpark",
              "language": "python",
              "argv": [ "/root/anaconda3/bin/python", "-m", "ipykernel",
              "-f", "{connection_file}" ],
              "env": {
                "SPARK_HOME": "/spark/spark-2.0.2/",
                "PYSPARK_PYTHON":"/root/anaconda3/bin/python",
                "PYTHONPATH": "/spark/spark-2.0.2/python/:/spark/
                spark-2.0.2/python/lib/py4j-0.10.3-src.zip",
                "PYTHONSTARTUP": "/spark/spark-2.0.2/python/pyspark/
                 shell.py",
                "PYSPARK_SUBMIT_ARGS": "--master spark://sparkmaster:7077
                 pyspark-shell"
              }
            }
  3. Open the notebook: Now when you open the Notebook with jupyter notebook command, you will find an additional kernel installed. You can create new Notebooks with the new Kernel:

    Figure 11.14: New Kernel

lock icon The rest of the chapter is locked
arrow left Previous Chapter
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime}