Creating a Supercomputer

Rick Golden

December 2015

In this article by Rick Golden, the author of the book, Raspberry Pi Networking Cookbook - Second Edition, we will learn how to create a supercomputer (Apache Spark) using four Raspberry Pis.

(For more resources related to this topic, see here.)

This article turns four Raspberry Pis into a supercomputer using Apache Spark.

Apache Spark™ is a fast and general engine for large-scale data processing. In this recipe, Apache Spark is installed on four Raspberry Pis that have been networked into a small computer cluster. The cluster is then used to demonstrate the speed of supercomputing by calculating the value of Pi using a Monte Carlo algorithm.

After reading this article, you will have a Raspberry Pi supercomputer.

Getting ready

The following ingredients are required to create a supercomputer:

  • Four basic networking setups for the Raspberry Pi
  • A high-speed network switch

This recipe does not require the desktop GUI and could either be run from the text-based console or from within LXTerminal.

With the Secure Shell server running on each Raspberry Pi, this recipe can be completed remotely using a Secure Shell client. Typically, a website is managed remotely.

All the Raspberry Pis should be connected directly to the same network switch.

How to do it...

Perform the following steps to build a Raspberry Pi supercomputer:

  1. Log in to each Raspberry Pi and set its hostname. One Raspberry Pi will be the Spark master server, and the other three will be Spark slaves. Name the four Raspberry Pis spark-master, spark-slave-a, spark-slave-b, and spark-slave-c.
  2. Now, let's set up a secure communication between master and slaves. Use the ssh-keygen command on spark-master to generate a pair of SSH keys. Press <enter> to accept the default file location (/home/pi/.ssh/id_rsa). Then, press <enter> twice to use an empty passphrase (the Spark automation requires an empty passphrase).
    pi@spark-master ~ $ ssh-keygen
    
    Generating public/private rsa key pair.
    Enter file in which to save the key (/home/pi/.ssh/id_rsa): 
    
    Enter passphrase (empty for no passphrase): 
    
    Enter same passphrase again: 
    
    Your identification has been saved in /home/pi/.ssh/id_rsa.
    Your public key has been saved in /home/pi/.ssh/id_rsa.pub.
    The key fingerprint is:
    29:0e:95:61:a6:e6:30:8f:23:66:cd:68:d3:c4:0c:8e pi@spark-master
    
    The key's randomart image is:
    +---[RSA 2048]----+
    | .    +          |
    |o +  + o         |
    |E.o+o o          |
    |  *B .   .       |
    |.*o++ . S        |
    |+... o .         |
    |      .          |
    |                 |
    |                 |
    +-----------------+
    
    pi@spark-master ~ $ 
  3. Use the ssh-copy-id command to copy the newly created public key from spark-master to each of the Spark slaves (spark-slave-a, spark-slave-b, and spark-slave-c), as follows:
    pi@spark-master ~ $ ssh-copy-id pi@spark-slave-a.local
    
    The authenticity of host 'spark-slave-a.local (192.168.2.6)' can't be established.
    ECDSA key fingerprint is e9:55:ff:6c:69:be:5d:8f:80:de:b2:d9:85:eb:1b:90.
    
    Are you sure you want to continue connecting (yes/no)? yes
    
    /usr/bin/ssh-copy-id: INFO: attempting to log in with the new key(s), to filter out any that are already installed
    /usr/bin/ssh-copy-id: INFO: 1 key(s) remain to be installed -- if you are prompted now it is to install the new keys
    
    pi@spark-slave-a.local's password: 
    
    Number of key(s) added: 1

    Repeat step 3 for each of the slaves: spark-slave-a, spark-slave-b, and spark-slave-c.

  4. Note that a secure shell login (ssh) from spark-master to the slaves no longer requires a password for authentication:
    pi@spark-master ~ $ ssh spark-slave-a.local
    
    The programs included with the Debian GNU/Linux system are free software;
    the exact distribution terms for each program are described in the
    individual files in /usr/share/doc/*/copyright.
    
    Debian GNU/Linux comes with ABSOLUTELY NO WARRANTY, to the extent
    permitted by applicable law.
    Last login: Mon Nov  2 21:29:28 2015 from 192.168.2.1
    
    pi@spark-slave-a ~ $ 
  5. Now, it's about downloading the Apache Spark software distribution. Use a web bowser to locate the correct Apache Spark software distribution package on the Apache Spark website's download page (http://spark.apache.org/downloads.html), which is shown in the following screenshot:
  6. On the download page, use the following drop-down options:
    1. Choose a Spark release: 1.5.1 (Oct 02 2015)
    2. Choose a package type: Pre-built for Hadoop 2.6 and later
    3. Choose a download type: Select Apache Mirror

      Once the correct choices have been made for 1, 2, and 3, click on the link (spark-1.5.1-bin-haddop2.6.tgz) that appears at 4. Download Spark.

  7. Note that the next web page displays the actual download link for the correct Apache Spark software distribution package (http://www.eu.apache.org/dist/spark/spark-1.5.1/spark-1.5.1-bin-hadoop2.6.tgz).
  8. Use the wget command on spark-master to download the Apache Spark software distribution page, as follows:
    pi@spark-master ~ $ wget http://www.eu.apache.org/dist/spark/spark-1.5.1/spark-1.5.1-bin-hadoop2....
    
    --2015-11-05 17:41:01--  http://www.eu.apache.org/dist/spark/spark-1.5.1/spark-1.5.1-bin-hadoop2....
    Resolving www.eu.apache.org (www.eu.apache.org)... 88.198.26.2, 2a01:4f8:130:2192::2
    Connecting to www.eu.apache.org (www.eu.apache.org)|88.198.26.2|:80... connected.
    HTTP request sent, awaiting response... 200 OK
    Length: 280901736 (268M) [application/x-gzip]
    Saving to: 'spark-1.5.1-bin-hadoop2.6.tgz'
    
    spark-1.5.1-bin-hadoop2.6.tgz   100%[============>] 267.89M   726KB/s   in 4m 38s 
    
    2015-11-05 17:45:39 (987 KB/s) - 'spark-1.5.1-bin-hadoop2.6.tgz' saved [280901736/280901736]
    
    pi@spark-master ~ $ 
  9. Use the scp command to copy the Apache Spark software distribution package to each slave (spark-slave-a, spark-slave-b, and spark-slave-c), as follows:
    pi@spark-master ~ $ scp spark-1.5.1-bin-hadoop2.6.tgz spark-slave-a.local:.
    
    spark-1.5.1-bin-hadoop2.6.tgz                       100%  268MB   4.4MB/s   01:01    
    
    pi@spark-master ~ $ scp spark-1.5.1-bin-hadoop2.6.tgz spark-slave-b.local:.
    
    spark-1.5.1-bin-hadoop2.6.tgz                       100%  268MB   4.0MB/s   01:07    
    
    pi@spark-master ~ $ scp spark-1.5.1-bin-hadoop2.6.tgz spark-slave-c.local:.
    
    spark-1.5.1-bin-hadoop2.6.tgz                       100%  268MB   5.2MB/s   00:47    
    
    pi@spark-master ~ $ 
  10. Use the tar command to unpack the Apache Spark software distribution on each Raspberry Pi (spark-master, spark-slave-a, spark-slave-b, and spark-slave-c), as follows:
    pi@spark-master ~ $ tar xvfz spark-1.5.1-bin-hadoop2.6.tgz 
    
    spark-1.5.1-bin-hadoop2.6/
    spark-1.5.1-bin-hadoop2.6/NOTICE
    spark-1.5.1-bin-hadoop2.6/CHANGES.txt
    spark-1.5.1-bin-hadoop2.6/python/
    spark-1.5.1-bin-hadoop2.6/python/run-tests.py
    spark-1.5.1-bin-hadoop2.6/python/test_support/
    spark-1.5.1-bin-hadoop2.6/python/test_support/userlibrary.py
    spark-1.5.1-bin-hadoop2.6/python/test_support/userlib-0.1.zip
    spark-1.5.1-bin-hadoop2.6/python/test_support/sql/
    spark-1.5.1-bin-hadoop2.6/python/test_support/sql/people.json
    spark-1.5.1-bin-hadoop2.6/python/test_support/sql/orc_partitioned/
    spark-1.5.1-bin-hadoop2.6/python/test_support/sql/orc_partitioned/b=1/
    spark-1.5.1-bin-hadoop2.6/python/test_support/sql/orc_partitioned/b=1/c=1/
    
    ...

    Repeat step 10 on each Raspberry Pi, namely spark-master, spark-slave-a, spark-slave-b, and spark-slave-c.

  11. Use the mv command to move the Apache Spark installation directory (spark-1.5.1-bin-hadoop2.6) to a more convenient location on each Raspberry Pi (/opt/spark), as follows:
    pi@spark-master ~ $ sudo mv spark-1.5.1-bin-hadoop2.6 /opt/spark
    
    pi@spark-master ~ $ 
  12. Now, configure the Spark master. Use the cat command on spark-master to create a list of slaves, as follows:
    pi@spark-master ~/ $ cat <<EOD >/opt/spark/conf/slaves
    
    spark-slave-a.local
    spark-slave-b.local
    spark-slave-c.local
    
    EOD
    
    pi@spark-master ~/ $ 
  13. Use the echo command on spark-master to create an execution environment configuration file (/opt/spark/conf/spark-env.sh). The configuration file should have one line, which sets the IP address of the Spark master server (SPARK_MASTER_IP) to the IP address of the Raspberry Pi named spark-master (hostname –I), as follows:
    pi@spark-master ~ $ echo "SPARK_MASTER_IP=`hostname -I`" >/opt/spark/conf/spark-env.sh
  14. Use the scp command on spark-master to copy the Spark execution environment configuration file (spark-env.sh) to each Spark slave (spark-slave-a, spark-slave-b, and spark-slave-c), as follows:
    pi@spark-master ~ $ scp /opt/spark/conf/spark-env.sh spark-slave-a:/opt/spark/conf/spark-env.sh
    
    spark-env.sh                                  100%   30     0.0KB/s   00:00    
    
    pi@spark-master ~ $ scp /opt/spark/conf/spark-env.sh spark-slave-b.local:/opt/spark/conf/spark-env.sh
    
    spark-env.sh                                  100%   30     0.0KB/s   00:00    
    
    pi@spark-master ~ $ scp /opt/spark/conf/spark-env.sh spark-slave-c.local:/opt/spark/conf/spark-env.sh
    
    spark-env.sh                                  100%   30     0.0KB/s   00:00    
    
    pi@spark-master ~ $ 
  15. Use the echo command on spark-master to append an additional memory constraint (SPARK_DRIVER_MEMORY=512m) to the execution environment (spark-env.sh) of the Spark master server (spark-master) so that enough memory remains free on the master server to run spark jobs, as follows:
    pi@spark-master ~ $ echo "SPARK_DRIVER_MEMORY=512m" >>/opt/spark/conf/spark-env.sh
    
    pi@spark-master ~ $ 
  16. Use the echo command on spark-master to append the local IP address (SPARK_LOCAL_IP) to the execution environment (spark-env.sh). This reduces the warnings in the output from the Spark jobs:
    pi@spark-master ~ $ echo "SPARK_LOCAL_IP=$(hostname -I)" >>/opt/spark/conf/spark-env.sh
    
    pi@spark-master ~ $ 
  17. Use the sed command to change the logging level of the Spark jobs from INFO, which produces a lot of informational output, to WARN, which produces a lot less output, as follows:
    pi@spark-master ~ $ sed 's/rootCategory=INFO/rootCategory=WARN/' 
    spark/conf/log4j.properties.template >/opt/spark/conf/log4j.properties
    
    pi@spark-master ~ $ 
  18. At this point, the Spark cluster is ready to start.

    The next steps calculate Pi both with and without the Spark cluster so that the duration of the two calculation methods can be compared.

  19. Now, calculate Pi without using the Spark cluster. Use the cat command on spark-master to create a simple Python script to calculate Pi without using the Spark cluster, as follows:
    pi@spark-master ~ $ cat <<EOD >pi.py
    
    from operator import add
    from random   import random
    from time     import clock
    
    
    MSG  = "Python estimated Pi at %f in %f seconds"
    
    n = 1000000
    
    
    def f(_):
        x = random() * 2 - 1
        y = random() * 2 - 1
        return 1 if x ** 2 + y ** 2 < 1 else 0
    
    
    def main():
        st = clock()
        tries = map( f, range( 1, n + 1 ) )
        count = reduce( add, tries )
        et = clock()
        print( MSG % ( 4.0 * count / n, et - st ) )    
    
    
    if __name__ == "__main__":
        main()
    
    
    EOD
    
    pi@spark-master ~ $ 
  20. Use the python command on spark-master to run the script to calculate Pi (pi.py) without a Spark cluster, as follows:
    pi@spark-master ~ $ python pi.py 
    
    Python esitmated PI at 3.141444 in 13.430613 seconds
    
    pi@spark-master ~ $
  21. Note that it took one Raspberry Pi (spark-master) more than 13 seconds (13.430613 seconds) to calculate Pi without using Spark.
  22. Now, calculate Pi using the Spark cluster. Use the cat command on spark-master to create a simple Python script that parallelizes the calculation of Pi for use on the Spark cluster, as follows:
    pi@spark-master ~ $ cat <<EOD >pi-spark.py
    
    from __future__ import print_function
    
    from operator   import add
    from random     import random
    from sys        import argv
    from time       import clock
    
    from pyspark    import SparkConf, SparkContext
    
    
    APP_NAME = "MonteCarloPi"
    MSG      = "Spark estimated Pi at %f in %f seconds using %i partitions"
    
    master     =      argv[ 1 ]   if len( argv ) > 1 else "local"
    partitions = int( argv[ 2 ] ) if len( argv ) > 2 else 2
    
    n = 1000000
    
    
    def f(_):
        x = random() * 2 - 1
        y = random() * 2 - 1
        return 1 if x ** 2 + y ** 2 < 1 else 0
    
    
    def main(sc):
        st    = clock()
        tries = sc.parallelize( range( 1, n + 1 ), partitions ).map( f )
        count = tries.reduce( add )
        et    = clock()
        print( MSG % ( 4.0 * count / n, et - st, partitions ) )    
    
    
    if __name__ == "__main__":
        conf = SparkConf()
        conf.setMaster( master )
        conf.setAppName( APP_NAME )
        sc = SparkContext( conf = conf )
        main( sc )
        sc.stop()
    
    
    EOD
    
    pi@spark-master ~ $ 
  23. Use the start-all.sh shell script on spark-master to start the Apache Spark cluster. Starting the cluster may take 30 seconds:
    pi@spark-master ~ $ /opt/spark/sbin/start-all.sh
    
    starting org.apache.spark.deploy.master.Master, logging to /home/pi/spark/sbin/../logs/spark-pi-org.apache.spark.deploy.master.Master-1-spark-master.out
    spark-slave-c.local: starting org.apache.spark.deploy.worker.Worker, logging to /home/pi/spark/sbin/../logs/spark-pi-org.apache.spark.deploy.worker.Worker-1-spark-slave-c.out
    spark-slave-b.local: starting org.apache.spark.deploy.worker.Worker, logging to /home/pi/spark/sbin/../logs/spark-pi-org.apache.spark.deploy.worker.Worker-1-spark-slave-b.out
    spark-slave-a.local: starting org.apache.spark.deploy.worker.Worker, logging to /home/pi/spark/sbin/../logs/spark-pi-org.apache.spark.deploy.worker.Worker-1-spark-slave-a.out
    
    pi@spark-master ~ $ 
  24. Use a web browser to view the status of the cluster by browsing to the cluster status page at http://spark-master.local:8080/, as shown in the following screenshot:
  25. Wait until the Spark master server and all the three slaves have started. Three worker IDs will be displayed on the status page when the cluster is ready to compute. Refresh the page, if necessary.
  26. Submit the Python script (pi-spark.py) that is used to calculate Pi to the Spark cluster, as follows:
    pi@spark-master ~ $ export SPARK_MASTER_URL="http://$(hostname –I | tr –d [:space:]):7077"
    
    pi@spark-master ~ $ export PATH=/opt/spark/bin:$PATH
    
    
    pi@spark-master ~ $ spark-submit pi-spark.py $SPARK_MASTER_URL 24
    
    15/11/04 21:39:38 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
    15/11/04 21:39:51 WARN MetricsSystem: Using default name DAGScheduler for source because spark.app.id is not set.
    [Stage 0:>                                                         (0 + 0) / 24]15/11/04 21:40:00 WARN SizeEstimator: Failed to check whether UseCompressedOops is set; assuming yes
    15/11/04 21:40:05 WARN TaskSetManager: Stage 0 contains a task of very large size (122 KB). The maximum recommended task size is 100 KB.
    
    Spark esitmated Pi at 3.143368 in 0.720023 seconds using 24 partitions          
    
    pi@spark-master ~ $ 
  27. Note that it took the Spark cluster less than a second (0.720023 seconds) to calculate Pi. That's more than 185 times faster!
  28. The Raspberry Pi supercomputer is working!

How it works...

This recipe has the following six parts:

  • Setting up a secure communication between the master and slaves
  • Downloading the Apache Spark software distribution
  • Installing Apache Spark on each Raspberry Pi in the cluster
  • Configuring the Spark master
  • Calculating Pi without using the Spark cluster
  • Calculating Pi using the Spark cluster

The recipe begins by setting the hostnames of the four Raspberry Pi computers. One Raspberry Pi is selected as the Spark master (spark-master), the other three Raspberry Pis are the Spark slaves (spark-slave-a, spark-slave-b, and spark-slave-c).

Setting up secure communication between master and slaves

After the hostnames have been set, the ssh-keygen and ssh-copy-id commands are used to establish a secure communication link between the Spark master (spark-master) and each of its slaves (spark-slave-a, spark-slave-b, and spark-slave-c).

The ssh-keygen command is used to create a secure key pair (/home/pi/.ssh/id_rsa and /home/pi/.ssh/id_rsa.pub). The ssh-copy-id command is used to copy the public key (id_rsa.pub) from spark-master to each of the slaves.

After the public key of spark-master has been copied to each slave, it is possible to log in from spark-master to each slave without using a password. Having a secure login from a master to a slave without a password is a requirement for the automation of the startup of the cluster.

Downloading the Apache Spark software distribution

The Apache Spark download page (http://spark.apache.org/downloads.html) presents a number of choices that are used to determine the correct software distribution.

This recipe uses the 1.5.1 (Oct 02 2015) release of Spark that has been pre-built for Hadoop 2.6 and later. Once the correct choices have been made, a link is presented (spark-1.5.1-bin-hadoop2.6.tgz), which leads to the actual download page.

The wget command is used to download the Spark software distribution from the actual download page to spark-master using the link presented on the actual download page (http://www.us.apache.org/dist/spark/spark-1.5.1/spark-1.5.1-bin-hadoop2.6.tgz).

The software distribution has a size of 280 MB. It will take a while to download.

Once the Spark software distribution (spark-1.5.1-bin-hadoop2.6.tgz) is downloaded to spark-master, it is then copied using the scp command to the three slaves (spark-slave-a, spark-slave-b, and spark-slave-c).

Installing Apache Spark on each Raspberry Pi in the cluster

The tar command is use to unpack the Apache Spark software distribution (spark-1.5.1-bin-hadoop2.6.tgz) on each Raspberry Pi in the cluster (spark-master, spark-slave-a, spark-slave-b, and spark-slave-c).

After the software distribution has been unpacked in the home directory of the user, pi, it is moved by using the mv command to a more central location (/opt/spark).

Configuring the Spark master

The cat command is used to create a list of slaves (/opt/spark/conf/slaves). This list is used during the cluster startup to automatically start the slaves when spark-master is started. All the lines after the cat command up to the end-of-data (EOD) mark are copied to the list of slaves.

The echo command is used to create the Spark runtime environment file (spark-env.sh under /opt/spark/conf/) with one environment variable (SPARK_MASTER_IP) that is set to the IP address of spark-master (hostname –I).

The Spark runtime environment configuration file, spark-env.sh, is then copied from the spark-master to each slave (spark-slave-a, spark-slave-b, and spark-slave-c).

After the configuration file (spark-env.sh) has been copied to the slaves, two additional configuration parameters specific to spark-master are added to the file.

The echo command is used to append (>>) the SPARK_DRIVER_MEMORY parameter to the bottom of the configuration file. This parameter is used to limit the amount of memory used by the spark-master to 512m (512 MB). This leaves room in the spark-master memory pool to run the Spark jobs.

The echo command is also used to append the SPARK_LOCAL_IP parameter to the bottom of the configuration file (spark-env.sh). This parameter is set to the IP address of the spark-master (hostname –I). Setting this parameter eliminates some of the warning messages that occur when running the Spark jobs.

The sed command is used to modify the logging parameters of spark-master. The log4j.properties file is changed so that INFO messages are no longer displayed. Only warning messages (WARN) and error messages are displayed. This greatly reduces the output of the Spark jobs.

At this point, the Spark cluster is fully configured and ready to start.

Calculating Pi without using the Spark cluster

Before the Spark cluster is started, a simple Python script (pi.py) is created using the cat command to calculate Pi without using the Spark cluster.

This script (pi.py) uses the Monte Carlo method to estimate the value of Pi by randomly generating 1 million data points and testing each data point for inclusion in a circle. The ratio of points inside the circle to the total number of points will be approximately equal to Pi/4.

More information on calculating the value of Pi, including how to use the Monte Carlo method, can be found in Wikipedia (https://en.wikipedia.org/wiki/Pi).

The Python script that is used to estimate the value of Pi takes more than 13 seconds to run on a single standalone Raspberry Pi.

Calculating Pi using the Spark cluster

Another Python script (pi-spark.py) is created using the cat command.

This new script (pi-spark.py) uses the same Monte Carlo method to estimate the value of Pi using 1 million random data points. However, this script uses the SparkContext (sc) to create a resilient distributed dataset (RDD) that parallelizes the million data points (range( 1, n + 1 )) so that they can be distributed among the slaves for the actual calculation (f).

After the script is created, the Spark cluster is started (/opt/spark/sbin/start-all.sh). The startup script (start-all.sh) uses the contents of the /opt/conf/slaves file to locate and start the Spark slaves (spark-slave-a, spark-slave-b, and spark-slave-c).

A web browser is used to validate that all the slaves have started properly. The spark-master produces a small website (http://spark-master.local:8080/) that displays the status of the cluster. The Spark cluster's status page is not refreshed automatically. So, you will need to continually refresh the page until all the workers have started.

Each Spark slave is given a Worker ID when it connects to spark-master. You will need to wait until there are three workers before you can submit the Spark jobs, withone worker for each slave (spark-slave-a, spark-slave-b, and spark-slave-c).

Once all the slaves (workers) have started, the pi-spark.py Python script can be submitted to the cluster using the spark-submit command.

The spark-submit command passes two parameters, namely $SPARK_MASTER_URL and 24, to the pi-spark.py script.

The value of the SPARK_MASTER_URL is used to configure (SparkConf conf) the location of the Spark master (conf.setMaster( master )).

The second parameter of the pi-spark.py script (24) determines the number of compute partitions that are used to parallelize the calculations. Partitions divide the total number of calculations into compute groups (24 distinct groups).

The number of partitions should be a factor of the number of available computer cores. Here, we are using 2 partitions for each available computer core (24 = 2 x 12). There are twelve cores available—four cores in each of three Raspberry Pi slaves.

The SPARK_MASTER_URL and PATH environment variables are updated to simplify the spark-submit command line.

The SPARK_MASTER_URL is set to the IP address of the spark-master using the hostname –I command. The tr command is used to strip (-d) the trailing space ([:space:]) from the output of the hostname –I command.

The location of the Spark command directory (/opt/spark/bin) is prepended to the front of the PATH environment variable so that the Spark commands can be used without requiring their complete path.

Submitting the pi-spark.py script to the cluster for calculation takes a few seconds. However, once the calculation is distributed among the workers (slaves), it take less than a second (0.720023 seconds) to estimate the value of Pi. The Spark cluster is more than 185 times faster than a standalone Raspberry Pi.

The Raspberry Pi supercomputer is running!

There's more...

This recipe only begins to explore the possibility of creating a supercomputer from low-cost Raspberry Pi computers. For Spark (and Hadoop, on which Spark builds upon), there are numerous packages for statistical calculation and data visualization. More information on supercomputing using Spark (and Hadoop) can be found on the Apache Software Foundation website (http://www.apache.org).

See Also

Summary

In this article, you learned how to turn four Raspberry Pis into a supercomputer.

Resources for Article:


Further resources on this subject:


You've been reading an excerpt of:

Raspberry Pi Networking Cookbook - Second Edition

Explore Title