Time to Go to ClusterLand - Deploying Spark on a Cluster

"I see the moon like a clipped piece of silver. Like gilded bees, the stars cluster around her"

- Oscar Wilde

In the previous chapters, we have seen how to develop practical applications using different Spark APIs. However, in this chapter, we will see how Spark works in a cluster mode with its underlying architecture. Finally, we will see how to deploy a full Spark application on a cluster. In a nutshell, the following topics will be cover throughout this chapter:

Spark architecture in a cluster
Spark ecosystem and cluster management
Deploying Spark on a cluster
Deploying Spark on a standalone cluster
Deploying Spark on a Mesos cluster
Deploying Spark on YARN cluster
Cloud-based deployment
Deploying Spark on AWS

Spark architecture in a cluster

Hadoop-based MapReduce framework has been widely used for the last few years; however, it has some issues with I/O, algorithmic complexity, low-latency streaming jobs, and fully disk-based operation. Hadoop provides the Hadoop Distributed File System (HDFS) for efficient computing and storing big data cheaply, but you can only do the computations with a high-latency batch model or static data using the Hadoop-based MapReduce framework. The main big data paradigm that Spark has brought for us is the introduction of in-memory computing and caching abstraction. This makes Spark ideal for large-scale data processing and enables the computing nodes to perform multiple operations by accessing the same input data.

Spark's Resilient Distributed Dataset (RDD) model can do everything that the MapReduce paradigm can, and even more. Nevertheless, Spark...

Deploying the Spark application on a cluster

In this section, we will discuss how to deploy Spark jobs on a computing cluster. We will see how to deploy clusters in three deploy modes: standalone, YARN, and Mesos. The following figure summarizes terms that are needed to refer to cluster concepts in this chapter:

Figure 8: Terms that are needed to refer to cluster concepts (source: http://spark.apache.org/docs/latest/cluster-overview.html#glossary)

However, before diving onto deeper, we need to know how to submit a Spark job in general.

Submitting Spark jobs

Once a Spark application is bundled as either a jar file (written in Scala or Java) or a Python file, it can be submitted using the Spark-submit script located under the...

Summary

In this chapter, we discussed how Spark works in a cluster mode with its underlying architecture. You also saw how to deploy a full Spark application on a cluster. You saw how to deploy cluster for running Spark application in different cluster modes such as local, standalone, YARN, and Mesos. Finally, you saw how to configure Spark cluster on AWS using EC2 script. We believe that this chapter will help you to gain some good understanding of Spark. Nevertheless, due to page limitation, we could not cover many APIs and their underlying functionalities.

If you face any issues, please don't forget to report this to Spark user mailing list at user@spark.apache.org. Before doing so, make sure that you have subscribed to it. In the next chapter, you will see how to test and debug Spark applications.