Introduction to Apache Spark
Apache Spark is an open source framework for processing large datasets stored in heterogeneous data stores in an efficient and fast way. Sophisticated analytical algorithms can be easily executed on these large datasets. Spark can execute a distributed program 100 times faster than MapReduce. As Spark is one of the fast-growing projects in the open source community, it provides a large number of libraries to its users.
We shall cover the following topics in this chapter:
- A brief introduction to Spark
- Spark architecture and the different languages that can be used for coding Spark applications
- Spark components and how these components can be used together to solve a variety of use cases
- A comparison between Spark and Hadoop
What is Spark?
Apache Spark is a distributed computing framework which makes big-data processing quite easy, fast, and scalable. You must be wondering what makes Spark so popular in the industry, and how is it really different than the existing tools available for big-data processing? The reason is that it provides a unified stack for processing all different kinds of big data, be it batch, streaming, machine learning, or graph data.
Spark was developed at UC Berkeley’s AMPLab in 2009 and later came under the Apache Umbrella in 2010. The framework is mainly written in Scala and Java.
Spark provides an interface with many different distributed and non-distributed data stores, such as Hadoop Distributed File System (HDFS), Cassandra, Openstack Swift, Amazon S3, and Kudu. It also provides a wide variety of language APIs to perform analytics on the data stored in these data stores. These APIs include Scala, Java, Python, and R.
The basic entity of Spark is Resilient Distributed Dataset (RDD), which is a read-only partitioned collection of data. RDD can be created using data stored on different data stores or using existing RDD. We shall discuss this in more detail in Chapter 3, Spark RDD.
Spark needs a resource manager to distribute and execute its tasks. By default, Spark comes up with its own standalone scheduler, but it integrates easily with Apache Mesos and Yet Another Resource Negotiator (YARN) for cluster resource management and task execution.
One of the main features of Spark is to keep a large amount of data in memory for faster execution. It also has a component that generates a Directed Acyclic Graph (DAG) of operations based on the user program. We shall discuss these in more details in coming chapters.
The following diagram shows some of the popular data stores Spark can connect to:

Spark architecture overview
Spark follows a master-slave architecture, as it allows it to scale on demand. Spark's architecture has two main components:
- Driver Program: A driver program is where a user writes Spark code using either Scala, Java, Python, or R APIs. It is responsible for launching various parallel operations of the cluster.
- Executor: Executor is the Java Virtual Machine (JVM) that runs on a worker node of the cluster. Executor provides hardware resources for running the tasks launched by the driver program.
As soon as a Spark job is submitted, the driver program launches various operation on each executor. Driver and executors together make an application.
The following diagram demonstrates the relationships between Driver, Workers, and Executors. As the first step, a driver process parses the user code (Spark Program) and creates multiple executors on each worker node. The driver process not only forks the executors on work machines, but also sends tasks to these executors to run the entire application in parallel.
Once the computation is completed, the output is either sent to the driver program or saved on to the file system:
