Writing and configuring a Spark program
Satisfied with our experiment in the shell, let's now write our first Spark program. Open your favorite text editor and create a new file named simpleGraph.scala
and put it in the folder $SPARKHOME/exercises/chap1
. A template for a Spark program looks like the following code:
You can also see the entire code of our SimpleGraph.scala
file in the example files, which you can download from the Packt website.
Tip
Downloading the example code
You can download the example code files from your account at http://www.packtpub.com for all the Packt Publishing books you have purchased. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files e-mailed directly to you.
Let's go over this code to understand what is required to create and configure a Spark standalone program in Scala.
As a Scala program, our Spark application should be constructed within a top-level Scala object, which must have a main
function that has the signature: def main(args: Array[String]): Unit
. In other words, the main program accepts an array of strings as a parameter and returns nothing. In our example, the top-level object is SimpleGraphApp
.
At the beginning of simpleGraph.scala
, we have put the following import statements:
The first two lines import the SparkContext
class as well as some implicit conversions defined in its companion object. It is not very important to know what the implicit conversions are. Just make sure you import both SparkContext
and SparContext._
Note
When we worked in the Spark shell, SparkContext
and SparContext._
were imported automatically for us.
The third line imports SparkConf
, which is a wrapper class that contains the configuration settings of a Spark application, such as its application name, the memory size of each executor, and the address of the master or cluster manager.
Next, we have imported some RDD and GraphX-specific class constructors and operators with these lines:
The underscore after org.apache.spark.graphx
makes sure that all public APIs in GraphX get imported.
Within main
, we had to first configure the Spark program. To do this, we created an object called SparkConf
and set the application settings through a chain of setter methods on the SparkConf
object. SparkConf
provides specific setters for some common properties, such as the application name or master. Alternatively, a generic set
method can be used to set multiple properties together by passing them as a sequence of key-value pairs. The most common configuration parameters are listed in the following table with their default values and usage. The extensive list can be found at https://spark.apache.org/docs/latest/configuration.html:
In our example, we initialized the program as follows:
Precisely, we set the name of our application to "Tiny Social"
and the master to be the local machine on which we submit the application. In addition, the driver memory is set to 2 GB. Should we have set the master to be a cluster instead of local, we can specify the memory per executor by setting spark.executor.memory
instead of spark.driver.memory
.
Note
In principle, the driver and executor have different roles and, in general, they run on different processes except when we set the master to be local. The driver is the process that compiles our program into tasks, schedules these tasks to one of more executors, and maintains the physical location of every RDD. Each executor is responsible for executing the tasks, and storing and caching RDDs in memory.
It is not mandatory to set the Spark application settings in the SparkConf
object inside your program. Alternatively, when submitting our application, we could set these parameters as command-line options of the spark-submit
tool. We will cover that part in detail in the following sections. In this case, we will just create our SparkContext
object as:
After configuring the program, the next task is to load the data that we want to process by calling utility methods such as sc.textFile
on the SparkContext
object sc
:
Finally, the rest of the program consists of the same operations on RDDs and graphs that we have used in the shell.
Note
To avoid confusion when passing a relative file path to I/O actions such as sc.textFile()
, the convention used in this book is that the current directory of the command line is always set to the project root folder. For instance, if our Tiny Social app's root folder is $SPARKHOME/exercises/chap1
, then Spark will look for the data to be loaded in $SPARKHOME/exercises/chap1/data
. This assumes that we have put the files in that data
folder.
Building the program with the Scala Build Tool
After writing our entire program, we are going to build it using the Scala Build Tool (SBT). If you do not have SBT installed on your computer yet, you need to install it first. Detailed instructions on how to install SBT are available at http://www.scala-sbt.org/0.13/tutorial/index.html for most operating systems. While there are different ways to install SBT, I recommend using a package manager instead of the manual installation. After the installation, execute the following command to append the SBT installation folder to the PATH
environment variable:
Once we have SBT properly installed, we can use it to build our application with all its dependencies inside a single JAR package file, also called uber jar. In fact, when running a Spark application on several worker machines, an error will occur if some machines do not have the right dependency JAR.
By packaging an uber jar with SBT, the application code and its dependencies are all distributed to the workers. Concretely, we need to create a build definition file in which we set the project settings. Moreover, we must specify the dependencies and the resolvers that help SBT find the packages that are needed by our program. A resolver indicates the name and location of the repository that has the required JAR file. Let's create a file called build.sbt
in the project root folder $SPARKHOME/exercises/chap1
and insert the following lines:
By convention, the settings are defined by Scala expressions and they need to be delimited by blank lines. Earlier, we set the project name, its version number, the version of Scala, as well as the Spark library dependencies. To build the program, we then enter the command:
This will create a JAR file inside $SPARKHOME/exercises/chap1/target/scala-2.10/simple-project_2.10-1.0.jar
.