Fast Data Processing with Spark - Second Edition

4.5 (2 reviews total)
By Krishna Sankar , Holden Karau
  • Instant online access to over 7,500+ books and videos
  • Constantly updated with 100+ new titles each month
  • Breadth and depth in over 1,000+ technologies
  1. Installing Spark and Setting up your Cluster

About this book

Spark is a framework used for writing fast, distributed programs. Spark solves similar problems as Hadoop MapReduce does, but with a fast in-memory approach and a clean functional style API. With its ability to integrate with Hadoop and built-in tools for interactive query analysis (Spark SQL), large-scale graph processing and analysis (GraphX), and real-time analysis (Spark Streaming), it can be interactively used to quickly process and query big datasets.

Fast Data Processing with Spark - Second Edition covers how to write distributed programs with Spark. The book will guide you through every step required to write effective distributed programs from setting up your cluster and interactively exploring the API to developing analytics applications and tuning them for your purposes.

Publication date:
March 2015
Publisher
Packt
Pages
184
ISBN
9781784392574

 

Chapter 1. Installing Spark and Setting up your Cluster

This chapter will detail some common methods to set up Spark. Spark on a single machine is excellent for testing or exploring small datasets, but here you will also learn to use Spark's built-in deployment scripts with a dedicated cluster via SSH (Secure Shell). This chapter will explain the use of Mesos and Hadoop clusters with YARN or Chef to deploy Spark. For Cloud deployments of Spark, this chapter will look at EC2 (both traditional and EC2MR). Feel free to skip this chapter if you already have your local Spark instance installed and want to get straight to programming.

Regardless of how you are going to deploy Spark, you will want to get the latest version of Spark from https://spark.apache.org/downloads.html (Version 1.2.0 as of this writing). Spark currently releases every 90 days. For coders who want to work with the latest builds, try cloning the code directly from the repository at https://github.com/apache/spark. The building instructions are available at https://spark.apache.org/docs/latest/building-spark.html. Both source code and prebuilt binaries are available at this link. To interact with Hadoop Distributed File System (HDFS), you need to use Spark, which is built against the same version of Hadoop as your cluster. For Version 1.1.0 of Spark, the prebuilt package is built against the available Hadoop Versions 1.x, 2.3, and 2.4. If you are up for the challenge, it's recommended that you build against the source as it gives you the flexibility of choosing which HDFS Version you want to support as well as apply patches with. In this chapter, we will do both.

To compile the Spark source, you will need the appropriate version of Scala and the matching JDK. The Spark source tar includes the required Scala components. The following discussion is only for information—there is no need to install Scala.

The Spark developers have done a good job of managing the dependencies. Refer to the https://spark.apache.org/docs/latest/building-spark.html web page for the latest information on this. According to the website, "Building Spark using Maven requires Maven 3.0.4 or newer and Java 6+." Scala gets pulled down as a dependency by Maven (currently Scala 2.10.4). Scala does not need to be installed separately, it is just a bundled dependency.

Just as a note, Spark 1.1.0 requires Scala 2.10.4 while the 1.2.0 version would run on 2.10 and Scala 2.11. I just saw e-mails in the Spark users' group on this.

Tip

This brings up another interesting point about the Spark community. The two essential mailing lists are and . More details about the Spark community are available at https://spark.apache.org/community.html.

 

Directory organization and convention


One convention that would be handy is to download and install software in the /opt directory. Also have a generic soft link to Spark that points to the current version. For example, /opt/spark points to /opt/spark-1.1.0 with the following command:

sudo ln -f -s spark-1.1.0 spark

Later, if you upgrade, say to Spark 1.2, you can change the softlink.

But remember to copy any configuration changes and old logs when you change to a new distribution. A more flexible way is to change the configuration directory to /etc/opt/spark and the log files to /var/log/spark/. That way, these will stay independent of the distribution updates. More details are available at https://spark.apache.org/docs/latest/configuration.html#overriding-configuration-directory and https://spark.apache.org/docs/latest/configuration.html#configuring-logging.

 

Installing prebuilt distribution


Let's download prebuilt Spark and install it. Later, we will also compile a Version and build from the source. The download is straightforward. The page to go to for this is http://spark.apache.org/downloads.html. Select the options as shown in the following screenshot:

We will do a wget from the command line. You can do a direct download as well:

cd /opt
sudo wget http://apache.arvixe.com/spark/spark-1.1.1/spark-1.1.1-bin-hadoop2.4.tgz

We are downloading the prebuilt version for Apache Hadoop 2.4 from one of the possible mirrors. We could have easily downloaded other prebuilt versions as well, as shown in the following screenshot:

To uncompress it, execute the following command:

tar xvf spark-1.1.1-bin-hadoop2.4.tgz

To test the installation, run the following command:

/opt/spark-1.1.1-bin-hadoop2.4/bin/run-example SparkPi 10

It will fire up the Spark stack and calculate the value of Pi. The result should be as shown in the following screenshot:

 

Building Spark from source


Let's compile Spark on a new AWS instance. That way you can clearly understand what all the requirements are to get a Spark stack compiled and installed. I am using the Amazon Linux AMI, which has Java and other base stack installed by default. As this is a book on Spark, we can safely assume that you would have the base configurations covered. We will cover the incremental installs for the Spark stack here.

Note

The latest instructions for building from the source are available at https://spark.apache.org/docs/latest/building-with-maven.html.

Downloading the source

The first order of business is to download the latest source from https://spark.apache.org/downloads.html. Select Source Code from option 2. Chose a package type and either download directly or select a mirror. The download page is shown in the following screenshot:

We can either download from the web page or use wget. We will do the wget from one of the mirrors, as shown in the following code:

cd /opt
sudo wget http://apache.arvixe.com/spark/spark-1.1.1/spark-1.1.1.tgz
sudo tar -xzf spark-1.1.1.tgz

Tip

The latest development source is in GitHub, which is available at https://github.com/apache/spark. The latest version can be checked out by the Git clone at https://github.com/apache/spark.git. This should be done only when you want to see the developments for the next version or when you are contributing to the source.

Compiling the source with Maven

Compilation by nature is uneventful, but a lot of information gets displayed on the screen:

cd /opt/spark-1.1.1
export MAVEN_OPTS="-Xmx2g -XX:MaxPermSize=512M -XX:ReservedCodeCacheSize=512m"
mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 -DskipTests clean package

In order for the preceding snippet to work, we will need Maven installed in our system. In case Maven is not installed in your system, the commands to install the latest version of Maven are given here:

wget http://download.nextag.com/apache/maven/maven-3/3.2.5/binaries/apache-maven-3.2.5-bin.tar.gz
sudo tar -xzf apache-maven-3.2.5-bin.tar.gz
sudo ln -f -s apache-maven-3.2.5 maven
export M2_HOME=/opt/maven
export PATH=${M2_HOME}/bin:${PATH}

Tip

Detailed Maven installation instructions are available at http://maven.apache.org/download.cgi#Installation.

Sometimes you will have to debug Maven using the –X switch. When I ran Maven, the Amazon Linux AMI didn't have the Java compiler! I had to install javac for Amazon Linux AMI using the following command:

sudo yum install java-1.7.0-openjdk-devel

The compilation time varies. On my Mac it took approximately 11 minutes. The Amazon Linux on a t2-medium instance took 18 minutes. In the end, you should see a build success message like the one shown in the following screenshot:

Compilation switches

As an example, the switches for compilation of -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 are explained in https://spark.apache.org/docs/latest/building-spark.html#specifying-the-hadoop-version. –D defines a system property and –P defines a profile.

A typical compile configuration that I use (for YARN, Hadoop Version 2.6 with Hive support) is given here:

mvn clean package -Pyarn -Dyarn.version=2.6.0 -Phadoop-2.4 -Dhadoop.version=2.6.0 -Phive -DskipTests

Tip

You can also compile the source code in IDEA and then upload the built Version to your cluster.

Testing the installation

A quick way to test the installation is by calculating Pi:

/opt/spark/bin/run-example SparkPi 10

The result should be a few debug messages and then the value of Pi as shown in the following screenshot:

 

Spark topology


This is a good time to talk about the basic mechanics and mechanisms of Spark. We will progressively dig deeper, but for now let's take a quick look at the top level.

Essentially, Spark provides a framework to process vast amounts of data, be it in gigabytes and terabytes and occasionally petabytes. The two main ingredients are computation and scale. The size and effectiveness of the problems we can solve depends on these two factors, that is, the ability to apply complex computations over large amounts of data in a timely fashion. If our monthly runs take 40 days, we have a problem. The key, of course, is parallelism, massive parallelism to be exact. We can make our computational algorithm tasks go parallel, that is instead of doing the steps one after another, we can perform many steps in parallel or carry out data parallelism, that is, we run the same algorithms over a partitioned dataset in parallel. In my humble opinion, Spark is extremely effective in data parallelism in an elegant framework. As you will see in the rest of this book, the two components are Resilient Distributed Dataset (RDD) and cluster manager. The cluster manager distributes the code and manages the data that is represented in RDDs. RDDs with transformations and actions are the main programming abstractions and present parallelized collections. Behind the scenes, a cluster manager controls the distribution and interaction with RDDs, distributes code, and manages fault-tolerant execution. Spark works with three types of cluster managers – standalone, Apache Mesos, and Hadoop YARN. The Spark page at http://spark.apache.org/docs/latest/cluster-overview.html has a lot more details on this. I just gave you a quick introduction here.

Tip

If you have installed Hadoop 2.0, you are recommended to install Spark on YARN. If you have installed Hadoop 1.0, the standalone version is recommended. If you want to try Mesos, you can choose to install Spark on Mesos. Users are not recommended to install both YARN and Mesos.

The Spark driver program takes the program classes and hands them over to a cluster manager. The cluster manager, in turn, starts executors in multiple worker nodes, each having a set of tasks. When we ran the example program earlier, all these actions happened transparently in your machine! Later when we install in a cluster, the examples would run, again transparently, but across multiple machines in the cluster. That is the magic of Spark and distributed computing!

 

A single machine


A single machine is the simplest use case for Spark. It is also a great way to sanity check your build. In the spark/bin directory, there is a shell script called run-example, which can be used to launch a Spark job. The run-example script takes the name of a Spark class and some arguments. Earlier, we used the run-example script from the /bin directory to calculate the value of Pi. There is a collection of sample Spark jobs in examples/src/main/scala/org/apache/spark/examples/.

All of the sample programs take the parameter master (the cluster manager), which can be the URL of a distributed cluster or local[N], where N is the number of threads.

Going back to our run-example script, it invokes the more general bin/spark-submit script. For now, let's stick with the run-example script.

To run GroupByTest locally, try running the following code:

bin/run-example GroupByTest

It should produce an output like this given here:

14/11/15 06:28:40 INFO SparkContext: Job finished: count at GroupByTest.scala:51, took 0.494519333 s
2000
 

Running Spark on EC2


The ec2 directory contains the script to run a Spark cluster in EC2. These scripts can be used to run multiple Spark clusters and even run on spot instances. Spark can also be run on Elastic MapReduce, which is Amazon's solution for Map Reduce cluster management, and it gives you more flexibility around scaling instances. The Spark page at http://spark.apache.org/docs/latest/ec2-scripts.html has the latest on-running spark on EC2.

Running Spark on EC2 with the scripts

To get started, you should make sure you have EC2 enabled on your account by signing up at https://portal.aws.amazon.com/gp/aws/manageYourAccount. Then it is a good idea to generate a separate access key pair for your Spark cluster, which you can do at https://portal.aws.amazon.com/gp/aws/securityCredentials. You will also need to create an EC2 key pair so that the Spark script can SSH to the launched machines, which can be done at https://console.aws.amazon.com/ec2/home by selecting Key Pairs under Network & Security. Remember that key pairs are created per region, and so you need to make sure you create your key pair in the same region as you intend to run your Spark instances. Make sure to give it a name that you can remember as you will need it for the scripts (this chapter will use spark-keypair as its example key pair name.). You can also choose to upload your public SSH key instead of generating a new key. These are sensitive; so make sure that you keep them private. You also need to set AWS_ACCESS_KEY and AWS_SECRET_KEY as environment variables for the Amazon EC2 scripts:

chmod 400 spark-keypair.pem
export AWS_ACCESS_KEY_ID= AWSACcessKeyId
export AWS_SECRET_ACCESS_KEY=AWSSecretKey

You will find it useful to download the EC2 scripts provided by Amazon from http://aws.amazon.com/developertools/Amazon-EC2/351. Once you unzip the resulting zip file, you can add the bin to your PATH in a manner similar to what you did with the Spark bin:

wget http://s3.amazonaws.com/ec2-downloads/ec2-api-tools.zip
unzip ec2-api-tools.zip
cd ec2-api-tools-*
export EC2_HOME=`pwd`
export PATH=$PATH:`pwd`/bin

In order to test whether this works, try the following commands:

$ec2-describe-regions

This should display the following output:

REGION	eu-central-1    ec2.eu-central-1.amazonaws.com
REGION	sa-east-1       ec2.sa-east-1.amazonaws.com
REGION	ap-northeast-1  ec2.ap-northeast-1.amazonaws.com
REGION	eu-west-1       ec2.eu-west-1.amazonaws.com
REGION	us-east-1       ec2.us-east-1.amazonaws.com
REGION	us-west-1       ec2.us-west-1.amazonaws.com
REGION	us-west-2       ec2.us-west-2.amazonaws.com
REGION	ap-southeast-2  ec2.ap-southeast-2.amazonaws.com
REGION	ap-southeast-1  ec2.ap-southeast-1.amazonaws.com

Finally, you can refer to the EC2 command line tools reference page http://docs.aws.amazon.com/AWSEC2/latest/CommandLineReference/set-up-ec2-cli-linux.html as it has all the gory details.

The Spark EC2 script automatically creates a separate security group and firewall rules for running the Spark cluster. By default, your Spark cluster will be universally accessible on port 8080, which is a somewhat poor form. Sadly, the spark_ec2.py script does not currently provide an easy way to restrict access to just your host. If you have a static IP address, I strongly recommend limiting access in spark_ec2.py; simply replace all instances of 0.0.0.0/0 with [yourip]/32. This will not affect intra-cluster communication as all machines within a security group can talk to each other by default.

Next, try to launch a cluster on EC2:

./ec2/spark-ec2 -k spark-keypair -i pk-[....].pem -s 1 launch myfirstcluster

Tip

If you get an error message like The requested Availability Zone is currently constrained and...., you can specify a different zone by passing in the --zone flag.

The -i parameter (in the preceding command line) is provided for specifying the private key to log into the instance; -i pk-[....].pem represents the path to the private key.

If you get an error about not being able to SSH to the master, make sure that only you have the permission to read the private key otherwise SSH will refuse to use it.

You may also encounter this error due to a race condition, when the hosts report themselves as alive but the Spark-ec2 script cannot yet SSH to them. A fix for this issue is pending in https://github.com/mesos/spark/pull/555. For now, a temporary workaround until the fix is available in the version of Spark you are using is to simply sleep an extra 100 seconds at the start of setup_cluster using the –w parameter. The current script has 120 seconds of delay built in.

If you do get a transient error while launching a cluster, you can finish the launch process using the resume feature by running:

./ec2/spark-ec2 -i ~/spark-keypair.pem launch myfirstsparkcluster --resume

It will go through a bunch of scripts, thus setting up Spark, Hadoop and so forth. If everything goes well, you should see something like the following screenshot:

This will give you a bare bones cluster with one master and one worker with all of the defaults on the default machine instance size. Next, verify that it started up and your firewall rules were applied by going to the master on port 8080. You can see in the preceding screenshot that the UI for the master is the output at the end of the script with port at 8080 and ganglia at 5080.

Tip

Downloading the example code

You can download the example code files for all Packt books you have purchased from your account at http://www.packtpub.com. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files e-mailed directly to you.

Your AWS EC2 dashboard will show the instances as follows:

The ganglia dashboard shown in the following screenshot is a good place to monitor the instances:

Try running one of the example jobs on your new cluster to make sure everything is okay, as shown in the following screenshot:

The JPS should show this:

[email protected] ~]$ jps
1904 NameNode
2856 Jps
2426 Master
2078 SecondaryNameNode

The script has started Spark master, the Hadoop name node, and data nodes (in slaves).

Let's run the two programs that we ran earlier on our local machine:

cd spark
bin/run-example GroupByTest
bin/run-example SparkPi 10

The ease with which one can spin up a few nodes in the Cloud, install the Spark stack, and run the program in a distributed manner is interesting.

The ec2/spark-ec2 destroy <cluster name> command will terminate the instances.

Now that you've run a simple job on our EC2 cluster, it's time to configure your EC2 cluster for our Spark jobs. There are a number of options you can use to configure with the spark-ec2 script.

The ec2/ spark-ec2 –help command will display all the options available.

First, consider what instance types you may need. EC2 offers an ever-growing collection of instance types and you can choose a different instance type for the master and the workers. The instance type has the most obvious impact on the performance of your Spark cluster. If your work needs a lot of RAM, you should choose an instance with more RAM. You can specify the instance type with --instance-type= (name of instance type). By default, the same instance type will be used for both the master and the workers; this can be wasteful if your computations are particularly intensive and the master isn't being heavily utilized. You can specify a different master instance type with --master-instance-type= (name of instance).

EC2 also has GPU instance types, which can be useful for workers but would be completely wasted on the master. This text will cover working with Spark and GPUs later on; however, it is important to note that EC2 GPU performance may be lower than what you get while testing locally due to the higher I/O overhead imposed by the hypervisor.

Spark's EC2 scripts use Amazon Machine Images (AMI) provided by the Spark team. Usually, they are current and sufficient for most of the applications. You might need your own AMI in case of circumstances like custom patches (for example, using a different version of HDFS) for Spark, as they will not be included in the machine image.

Deploying Spark on Elastic MapReduce

In addition to the Amazon basic EC2 machine offering, Amazon offers a hosted Map Reduce solution called Elastic MapReduce (EMR). Amazon provides a bootstrap script that simplifies the process of getting started using Spark on EMR. You will need to install the EMR tools from Amazon:

mkdir emr
cd emr
wget http://elasticmapreduce.s3.amazonaws.com/elastic-mapreduce-ruby.zip
unzip *.zip

This way the EMR scripts can access your AWS account you will want, to create a credentials.json file:

 {
    "access-id": "<Your AWS access id here>",
    "private-key": "<Your AWS secret access key here>",
    "key-pair": "<The name of your ec2 key-pair here>",
    "key-pair-file": "<path to the .pem file for your ec2 key pair here>",
    "region": "<The region where you wish to launch your job flows (e.g us-east-1)>"
  }

Once you have the EMR tools installed, you can launch a Spark cluster by running:

elastic-mapreduce --create --alive --name "Spark/Shark Cluster" \
--bootstrap-action s3://elasticmapreduce/samples/spark/install-spark-shark.sh \
--bootstrap-name "install Mesos/Spark/Shark" \
--ami-version 2.0  \
--instance-type m1.large --instance-count 2

This will give you a running EC2MR instance after about 5 to 10 minutes. You can list the status of the cluster by running elastic-mapreduce -listode. Once it outputs j-[jobid], it is ready.

 

Deploying Spark with Chef (Opscode)


Chef is an open source automation platform that has become increasingly popular for deploying and managing both small and large clusters of machines. Chef can be used to control a traditional static fleet of machines and can also be used with EC2 and other cloud providers. Chef uses cookbooks as the basic building blocks of configuration and can either be generic or site-specific. If you have not used Chef before, a good tutorial for getting started with Chef can be found at https://learnchef.opscode.com/. You can use a generic Spark cookbook as the basis for setting up your cluster.

To get Spark working, you need to create a role for both the master and the workers as well as configure the workers to connect to the master. Start by getting the cookbook from https://github.com/holdenk/chef-cookbook-spark. The bare minimum need is setting the master hostname (as master) to enable worker nodes to connect and the username, so that Chef can be installed in the correct place. You will also need to either accept Sun's Java license or switch to an alternative JDK. Most of the settings that are available in spark-env.sh are also exposed through the cookbook settings. You can see an explanation of the settings in your section on "configuring multiple hosts over SSH". The settings can be set as per-role or you can modify the global defaults.

Create a role for the master with a knife role; create spark_master_role -e [editor]. This will bring up a template role file that you can edit. For a simple master, set it to this:

{
  "name": "spark_master_role",
  "description": "",
  "json_class": "Chef::Role",
  "default_attributes": {
  },
  "override_attributes": {
   "username":"spark",
   "group":"spark",
   "home":"/home/spark/sparkhome",
   "master_ip":"10.0.2.15", },
   "chef_type": "role",
   "run_list": [
    "recipe[spark::server]",
    "recipe[chef-client]",
   ],
   "env_run_lists": {
  }
}

Then create a role for the client in the same manner except that instead of spark::server, you need to use the spark::client recipe. Deploy the roles to different hosts:

knife node run_list add master role[spark_master_role]
knife node run_list add worker role[spark_worker_role]

Then run chef-client on your nodes to update. Congrats, you now have a Spark cluster running!

 

Deploying Spark on Mesos


Mesos is a cluster management platform for running multiple distributed applications or frameworks on a cluster. Mesos can intelligently schedule and run Spark, Hadoop, and other frameworks concurrently on the same cluster. Spark can be run on Mesos either by scheduling individual jobs as separate Mesos tasks or running all of Spark as a single Mesos task. Mesos can quickly scale up to handle large clusters beyond the size of which you would want to manage with plain old SSH scripts. Mesos, written in C++, was originally created at UC Berkley as a research project; it is currently undergoing Apache incubation and is actively used by Twitter.

The Spark web page has detailed instructions on installing and running Spark on Mesos.

To get started with Mesos, you can download the latest version from http://mesos.apache.org/downloads/ and unpack it. Mesos has a number of different configuration scripts you can use; for an Ubuntu installation use configure.ubuntu-lucid-64 and for other cases, the Mesos README file will point you at the configuration file you need to use. In addition to the requirements of Spark, you will need to ensure that you have the Python C header files installed (python-dev on Debian systems) or pass --disable-python to the configure script. Since Mesos needs to be installed on all the machines, you may find it easier to configure Mesos to install somewhere other than on the root, most easily alongside your Spark installation:

./configure --prefix=/home/sparkuser/mesos && make && make check && make install

Much like the configuration of Spark in standalone mode, with Mesos you need to make sure the different Mesos nodes can find each other. Start by having mesossprefix/var/mesos/deploy/masters to the hostname of the master and adding each worker hostname to mesossprefix/var/mesos/deploy/slaves. Then you will want to point the workers at the master (and possibly set some other values) in mesossprefix/var/mesos/conf/mesos.conf.

Once you have Mesos built, it's time to configure Spark to work with Mesos. This is as simple as copying the conf/spark-env.sh.template to conf/spark-env.sh and updating MESOS_NATIVE_LIBRARY to point to the path where Mesos is installed. You can find more information about the different settings in spark-env.sh in first table of the next section.

You will need to install both Mesos and Spark on all of the machines in your cluster. Once both Mesos and Spark are configured, you can copy the build to all of the machines using pscp, as shown in the following command:

pscp -v -r -h  -l sparkuser ./mesos /home/sparkuser/mesos

You can then start your Mesos clusters using mesosprefix/sbin/mesos-start-cluster.sh and schedule your Spark on Mesos by using mesos://[host]:5050 as the master.

 

Spark on YARN


YARN is Apache Hadoop's NextGen MapReduce. The Spark project provides an easy way to schedule jobs on YARN once you have a Spark assembly built. The Spark web page http://spark.apache.org/docs/latest/running-on-yarn.html has the configuration details for YARN, which we had built earlier for when compiling with the –Pyarn switch. It is important that the Spark job you create uses a standalone master URL. The example Spark applications all read the master URL from the command line arguments; so specify --args standalone.

To run the same example as given in the SSH section, write the following commands:

sbt/sbt assembly #Build the assembly
SPARK_JAR=./core/target/spark-core-assembly-1.1.0.jar ./run spark.deploy.yarn.Client --jar examples/target/scala-2.9.2/spark-examples_2.9.2-0.7.0.jar --class spark.examples.GroupByTest --args standalone --num-workers 2 --worker-memory 1g --worker-cores 1
 

Spark Standalone mode


If you have a set of machines without any existing cluster management software, you can deploy Spark over SSH with some handy scripts. This method is known as "standalone mode" in the Spark documentation at http://spark.apache.org/docs/latest/spark-standalone.html. An individual master and worker can be started by sbin/start-master.sh  and sbin/start-slaves.sh respectively. The default port for the master is 8080. As you likely don't want to go to each of your machines and run these commands by hand, there are a number of helper scripts in bin/ to help you run your servers.

A prerequisite for using any of the scripts is having password-less SSH access set up from the master to all of the worker machines. You probably want to create a new user for running Spark on the machines and lock it down. This book uses the username "sparkuser". On your master, you can run ssh-keygen to generate the SSH keys and make sure that you do not set a password. Once you have generated the key, add the public one (if you generated an RSA key, it would be stored in ~/.ssh/id_rsa.pub by default) to ~/.ssh/authorized_keys2 on each of the hosts.

Tip

The Spark administration scripts require that your usernames match. If this isn't the case, you can configure an alternative username in your ~/.ssh/config.

Now that you have the SSH access to the machines set up, it is time to configure Spark. There is a simple template in [filepath]conf/spark-env.sh.template[/filepath], which you should copy to [filepath]conf/spark-env.sh[/filepath]. You will need to set SCALA_HOME to the path where you extracted Scala to. You may also find it useful to set some (or all) of the following environment variables:

Name

Purpose

Default

MESOS_NATIVE_LIBRARY

Point to math where Mesos lives

None

SCALA_HOME

Point to where you extracted Scala

None, must be set

SPARK_MASTER_IP

The IP address for the master to listen on and the IP address for the workers to connect to.

The result of running hostname

SPARK_MASTER_PORT

The port # for the Spark master to listen on

7077

SPARK_MASTER_WEBUI_PORT

The port # of the WEB UI on the master

8080

SPARK_WORKER_CORES

Number of cores to use

All of them

SPARK_WORKER_MEMORY

How much memory to use

Max of (system memory - 1GB, 512MB)

SPARK_WORKER_PORT

What port # the worker runs on

Rand

SPARK_WEBUI_PORT

What port # the worker WEB UI runs on

8081

SPARK_WORKER_DIR

Where to store files from the worker

SPARK_HOME/work_dir

Once you have your configuration done, it's time to get your cluster up and running. You will want to copy the version of Spark and the configuration you have built to all of your machines. You may find it useful to install pssh, a set of parallel SSH tools including pscp. The pscp makes it easy to scp to a number of target hosts, although it will take a while, as shown here:

pscp -v -r -h conf/slaves -l sparkuser ../opt/spark ~/

If you end up changing the configuration, you need to distribute the configuration to all of the workers, as shown here:

pscp -v -r -h conf/slaves -l sparkuser conf/spark-env.sh /opt/spark/conf/spark-env.sh

Tip

If you use a shared NFS on your cluster, while by default Spark names log files and similar with shared names, you should configure a separate worker directory, otherwise they will be configured to write to the same place. If you want to have your worker directories on the shared NFS, consider adding `hostname` for example SPARK_WORKER_DIR=~/work-`hostname`.

You should also consider having your log files go to a scratch directory for performance.

Then you are ready to start the cluster and you can use the sbin/start-all.sh, sbin/start-master.sh and sbin/start-slaves.sh scripts. It is important to note that start-all.sh and start-master.sh both assume that they are being run on the node, which is the master for the cluster. The start scripts all daemonize, and so you don't have to worry about running them in a screen:

ssh master bin/start-all.sh

If you get a class not found error stating "java.lang.NoClassDefFoundError: scala/ScalaObject", check to make sure that you have Scala installed on that worker host and that the SCALA_HOME is set correctly.

Tip

The Spark scripts assume that your master has Spark installed in the same directory as your workers. If this is not the case, you should edit bin/spark-config.sh and set it to the appropriate directories.

The commands provided by Spark to help you administer your cluster are given in the following table. More details are available in the Spark website at http://spark.apache.org/docs/latest/spark-standalone.html#cluster-launch-scripts.

Command

Use

bin/slaves.sh <command>

Runs the provided command on all of the worker hosts. For example, bin/slave.sh uptime will show how long each of the worker hosts have been up.

bin/start-all.sh

Starts the master and all of the worker hosts. Must be run on the master.

bin/start-master.sh

Starts the master host. Must be run on the master.

bin/start-slaves.sh

Starts the worker hosts.

bin/start-slave.sh

Start a specific worker.

bin/stop-all.sh

Stops master and workers.

bin/stop-master.sh

Stops the master.

bin/stop-slaves.sh

Stops all the workers.

You now have a running Spark cluster, as shown in the following screenshot! There is a handy Web UI on the master running on port 8080 you should go and visit, and on all of the workers on port 8081. The Web UI contains such helpful information as the current workers, and current and past jobs.

Now that you have a cluster up and running, let's actually do something with it. As with the single host example, you can use the provided run script to run Spark commands. All of the examples listed in examples/src/main/scala/spark/org/apache/spark/examples/ take a parameter, master, which points them to the master. Assuming that you are on the master host, you could run them like this:

./run-example GroupByTest spark://`hostname`:7077

Tip

If you run into an issue with java.lang.UnsupportedClassVersionError, you may need to update your JDK or recompile Spark if you grabbed the binary version. Version 1.1.0 was compiled with JDK 1.7 as the target. You can check the version of the JRE targeted by Spark with the following commands:

java -verbose -classpath ./core/target/scala-2.9.2/classes/
spark.SparkFiles |head -n 20
Version 49 is JDK1.5, Version 50 is JDK1.6 and Version 60 is JDK1.7

If you can't connect to localhost, make sure that you've configured your master (spark.driver.port) to listen to all of the IP addresses (or if you don't want to replace localhost with the IP address configured to listen to). More port configurations are listed at http://spark.apache.org/docs/latest/configuration.html#networking.

If everything has worked correctly, you will see the following log messages output to stdout:

13/03/28 06:35:31 INFO spark.SparkContext: Job finished: count at GroupByTest.scala:35, took 2.482816756 s
2000

References:

 

Summary


In this chapter, we have gotten Spark installed on our machine for local development and set up on our cluster, and so we are ready to run the applications that we write. While installing and maintaining a cluster is a good option, Spark is also available as a service option from Databricks. Databricks' upcoming Databricks Cloud for Spark available at http://databricks.com/product is a very convenient offering for anyone who does not want to deal with the set up/maintenance of the cluster. They have the concept of a big data pipeline — from ETL to Analytics. This looks truly interesting to explore!

In the next chapter, you will learn to use the Spark shell.

About the Authors

  • Krishna Sankar

    Krishna Sankar is a Senior Specialist—AI Data Scientist with Volvo Cars focusing on Autonomous Vehicles. His earlier stints include Chief Data Scientist at http://cadenttech.tv/, Principal Architect/Data Scientist at Tata America Intl. Corp., Director of Data Science at a bioinformatics startup, and as a Distinguished Engineer at Cisco. He has been speaking at various conferences including ML tutorials at Strata SJC and London 2016, Spark Summit, Strata-Spark Camp, OSCON, PyCon, and PyData, writes about Robots Rules of Order, Big Data Analytics - Best of the Worst, predicting NFL, Spark, Data Science, Machine Learning, Social Media Analysis as well as has been a guest lecturer at the Naval Postgraduate School. His occasional blogs can be found at https://doubleclix.wordpress.com/. His other passion is flying drones (working towards Drone Pilot License (FAA UAS Pilot) and Lego Robotics - you will find him at the St.Louis FLL World Competition as Robots Design Judge.

    Browse publications by this author
  • Holden Karau

    Holden Karau is a software development engineer and is active in the open source. She has worked on a variety of search, classification, and distributed systems problems at IBM, Alpine, Databricks, Google, Foursquare, and Amazon. She graduated from the University of Waterloo with a bachelor's of mathematics degree in computer science. Other than software, she enjoys playing with fire and hula hoops, and welding.

    Browse publications by this author

Latest Reviews

(2 reviews total)
Unfortunately I knew the material in the book, I bought it for a specific need, and it did not help. I think it would be good for beginners.
Excellent