Apache Spark is an open source framework for processing large datasets stored in heterogeneous data stores in an efficient and fast way.Sophisticated analytical algorithms can be easily executed on these large datasets. Spark can execute a distributed program 100 times faster thanÂ MapReduce. As Spark is one of the fast-growing projects in the open source community, it provides a large number of libraries to its users.
We shall cover the following topics in this chapter:
- A brief introduction to Spark
- Spark architectureÂ and the different languages that can be used for coding Spark applications
- Spark components and how these components can be used together to solve a variety of use cases
- A comparison between Spark and Hadoop
Apache Spark is a distributed computing framework which makes big-data processing quite easy, fast, and scalable.Â You must be wondering what makes Spark so popular in the industry, and how is it really different than the existing tools available for big-data processing?Â The reason is that it provides a unified stack for processing all different kinds of big data, be it batch, streaming, machine learning, or graph data.
Spark was developed at UC Berkeleyâs AMPLab in 2009 and later came under the Apache Umbrella in 2010. The framework is mainly written in Scala and Java.Â
Spark provides an interface with many different distributed and non-distributed data stores, such asÂ HadoopÂ Distributed File SystemÂ (HDFS), Cassandra, Openstack Swift, Amazon S3, and Kudu. It also provides a wide variety of language APIs to perform analytics on the data stored in these dataÂ stores. These APIs include Scala, Java, Python, and R.
The basic entity of Spark is Resilient Distributed Dataset (RDD), which is a read-only partitioned collection of data. RDD can be created using data stored on different data stores or using existing RDD. We shall discuss this in more detail in Chapter 3,Â Spark RDD.
Spark needs a resource manager to distribute and execute its tasks. By default, Spark comes up with its own standalone scheduler, but it integrates easily with Apache Mesos andÂ Yet Another Resource NegotiatorÂ (YARN) for cluster resource management and task execution.
One of the main features of Spark is to keep a large amount of data in memory for faster execution. It also has a component that generates a Directed Acyclic Graph (DAG) of operations based on the user program. We shall discuss these in more details in coming chapters.
- DriverÂ Program:Â A driver program is where a user writes Spark code using either Scala, Java, Python, or R APIs. It is responsible for launching various parallel operations of the cluster.
- Executor:Â Executor is the Java Virtual MachineÂ (JVM) that runs on a worker node of the cluster. Executor provides hardware resources for running the tasks launched by the driver program.
As soon as a Spark job is submitted, the driver program launches various operation on each executor.Â Driver and executors together make an application.
The following diagram demonstrates the relationships betweenÂ Driver, Workers, and Executors. As the first step,Â aÂ driver process parses the user code (Spark Program) and creates multiple executors on each worker node. The driver process not only forks the executors on work machines, but also sends tasks to these executors to run the entire application in parallel.
Driver, Workers, and Executors
Spark has integration with a variety of programming languages such as Scala, Java, Python, and R. Developers can write their Spark program in either of these languages. This freedom of language is also one of the reasons why Spark is popular among developers. If you compare this to Hadoop MapReduce, in MapReduce, the developers had only one choice: Java, which made it difficult for developers from another programming languages to work on MapReduce.
Scala is the primary language for Spark. More than 70% of Spark's code is writtenÂ inÂ Scalable Language (Scala). Scala is a fairly new language. It was developed by Martin Odersky in 2001, and it was first launched publicly in 2004. Like Java, Scala also generates a bytecode that runs on JVM.Â Scala brings advantages from both object-oriented and functional-oriented worlds. It provides dynamic programming without compromising on type safety. As Spark is primarily written in Scala, you can find almost all of the new libraries in Scala API.
Most of us are familiar with Java. Java is a powerful object-oriented programming language. The majority of big data frameworks are written in Java, which provides rich libraries to connect and process data with these frameworks.Â
Python is a functional programming language. It was developed by Guido van Rossum and was first released in 1991. For some time, Python was not popular among developers, but later, around 2006-07, it introduced some libraries such asÂ Numerical PythonÂ (NumPy) and Pandas, which became cornerstones and made Python popular among all types of programmers. In Spark, when the driver launches executors on worker nodes, it also starts a Python interpreter for each executor. In the case of RDD, the data is first shipped into the JVMs, and is then transferred to Python, which makes the job slow when working with RDDs.Â
R is a statistical programming language. It provides a rich library for analyzingÂ and manipulating the data, which is why it is very popular among data analysts, statisticians, and data scientists. Spark R integration is a way to provide data scientists the flexibility required to work on big data. Like Python, SparkR also creates an RÂ process for each executor to work on data transferred from the JVM.
StructuredÂ QueryÂ LanguageÂ (SQL) is one of the most popular and powerful languages for working with tables stored in the database. SQL also enables non-programmers to work with big data. Spark provides Spark SQL, which is a distributed SQL query engine. We will learn about it in more detail in Chapter 6, Spark SQL.
As discussed earlier in this chapter, the main philosophy behind Spark is to provide a unified engine for creating different types of big data applications.Â Spark provides a variety of libraries to work with batch analytics, streaming, machine learning, and graph analysis.
It is not as if these kinds of processing were never done before Spark, but for every new big data problem, there was a new tool in the market; for example, for batch analysis, we had MapReduce, Hive, and Pig. For Streaming, we had Apache Storm, for machine learning, we had Mahout. Although these tools solve the problems that they are designed for, each of them requires a learning curve. This is where Spark brings advantages. Spark provides a unified stack for solving all of these problems. It has components that are designed for processing all kinds of big data. It also provides many libraries to read or write different kinds of data such as JSON, CSV, and Parquet.
Here is an example of a Spark stack:
Having a unified stack brings lots of advantages. Let's look at some of the advantages:
- First is code sharing and reusability. Components developed by the data engineering team can easily be integrated by the data science team to avoid code redundancy.Â
- Secondly, Â there is always a new tool coming in the market to solve a different big data usecase. Most of the developers struggle to learn new tools and gain expertise in order to use them efficiently. With Spark, developers just have to learn the basic concepts which allows developers to work on different big data use cases.
- Thirdly, its unified stack gives great power to the developers to explore new ideas without installing new tools.
The following diagramÂ provides a high-level overview of different big-data applications powered by Spark:
Spark use cases
- The basic components, such as RDD and DataFrames
- The APIs available to perform operations on these basic abstractions
- Shared or distributed variables, such as broadcast variables and accumulators
We shall look at them in more detail in the upcoming chapters.
Spark Core also defines all the basic functionalities, such as task management, memory management, basic I/O functionalities, and more. Itâs a good idea to have a look at the Spark code on GitHub (https://github.com/apache/spark).
Spark SQL is where developers can work with structured and semi-structured data such as Hive tables, MySQL tables, Parquet files, AVRO files, JSON files, CSV files, and more. Another alternative to process structured data is using Hive. Hive processes structured data stored on HDFS using Hive Query Language (HQL). It internally uses MapReduce for its processing, and we shall see how Spark can deliver better performance than MapReduce. In the initial version of Spark, structured data used to be defined as schema RDD (another type of an RDD). When there is data along with the schema, SQL becomes the first choice of processing that data. Spark SQL is Spark's component that enables developers to process data with Structured Query LanguageÂ (SQL).
Using Spark SQL, business logic can be easily written in SQL and HQL. This enables data warehouse engineers with a good knowledge of SQL to make use of Spark for theirÂ extract, transform, loadÂ (ETL) processing. Hive projects can easily be migrated on Spark using Spark SQL, without changing the Hive scripts.Â
Spark SQL is also the first choice for data analysis and data warehousing. Spark SQL enables the data analysts to write ad hoc queries for their exploratory analysis. Spark provides Spark SQL shell, where you can run the SQL-like queries and they get executed on Spark. Spark internally converts the code into a chain of RDD computations, while Hive converts the HQL job into a series of MapReduce jobs. Using Spark SQL, developers can also make use of caching (a Spark feature that enables data to be kept in memory), which can significantly increase the performance of their queries.
Spark Streaming is a package that is used to process a stream of data in real time. There can be many different types of a real-time stream of data; for example, an e-commerce website recording page visits in real time, credit card transactions, a taxi provider app sending information about trips and location information of drivers and passengers, and more. In a nutshell, all of these applications are hosted on multiple web servers that generate event logs in real time.
Spark Streaming makes use of RDD and defines some more APIs to process the data stream in real time. As Spark Streaming makes use of RDD and its APIs, it is easy for developers to learn and execute the use cases without learning a whole new technology stack.
Spark 2.x introduced structured streaming, which makes use of DataFrames rather than RDD to process the data stream. Using DataFrames as its computation abstraction brings all the benefits of the DataFrame API to stream processing. We shall discuss the benefits of DataFrames over RDD in coming chapters.
Spark Streaming has excellent integration with some of the most popular data messaging queues, such as Apache Flume and Kafka. It can be easily plugged into these queues to handle a massive amount of data streams.
It is difficult to run a machine-learning algorithm when your data is distributed across multiple machines. There might be a case when the calculation depends on another point that is stored or processed on a different executor. Data can be shuffling across executors or workers, but shuffle comes with a heavy cost. Spark provides a way to avoid shuffling data. Yes, it is caching. Spark's ability to keep a large amount of data in memory makes it easy to write machine-learning algorithms.
- Inbuilt machine-learning algorithms such as Classification, Regression, Clustering, and more
- Features such as pipelining, vector creation, and more
The previous algorithms and features are optimized for data shuffle and to scale across the cluster.
Spark also has a component to process graph data. A graph consists of vertices and edges. Edges define the relationship between vertices. Some examples of graph data are customers's product ratings, social networks, Wikipedia pages and their links, airport flights, and more.
Spark provides GraphX to process such data. GraphX makes use of RDD for its computation and allows users to create vertices and edges with some properties. Using GraphX, you can define and manipulate a graph or get some insights from the graph.
Spark provides aÂ localÂ mode for the job execution, where both driver and executors run within a single JVM on the client machine.Â This enables developers to quickly get started with Spark without creating a cluster.Â We will mostly use this mode of job execution throughout this book for our code examples, and explain the possible challenges with a cluster mode whenever possible.Â Spark also works with a variety of schedules. Letâs have a quick overview of them here.
Spark comes with its own scheduler, called aÂ standalone scheduler. If you are running your Spark programs on a cluster that does not have a Hadoop installation, then there is a chance that you are using Sparkâs default standalone scheduler.
YARN is the default scheduler of Hadoop. It is optimized for batch jobs such as MapReduce, Hive, and Pig. Most of the organizations already have Hadoop installed on their clusters; therefore, Spark provides the ability to configure it with YARN for the job scheduling.
Spark also integrates well with Apache Mesos which is build using the same principles as the Linux kernel. Unlike YARN, Apache Mesos is general purpose cluster manager that does not bind to the Hadoop ecosystem. Another difference between YARN and Mesos is that YARN is optimized for the long-running batch workloads, whereas Mesos, ability to provide a fine-grained and dynamic allocation of resources makes it more optimized for streaming jobs.Â
Kubernetes is as a general-purpose orchestration framework for running containerized applications. Kubernetes provides multiple features such as multi-tenancy (running different versions of Spark on a physical cluster) and sharing of the namespace.Â At the time of writing this book, the Kubernetes scheduler is still in the experimental stage. For more details on running a Spark application on Kubernetes, please refer to Spark's documentation.Â
People generally get confused between Hadoop and Spark and how they are related. The intention of this section is to discuss the differences between Hadoop and Spark, and also how they can be used together.
Hadoop is mainly a combination of the following components:
- Hive and Pig
HDFS is the storage layer where underlying data can be stored. HDFS provides features such as the replication of the data, fault tolerance, high availability, and more. Hadoop is schema-on-read; for instance, you donât have to specify the schema while writing the data to Hadoop, rather, you can use different schemas while reading the data. HDFS also provides different types of files formats, such as
NLInputFormat, and more. If you want to know more about these file formats, I would recommend readingÂ Hadoop: The Definitive Guide by Tom White.
Hadoopâs MapReduce is a programming model used to process the data available on HDFS. It consists of four main phases: Map, Sort, Shuffle, and Reduce. One of the main differences between Hadoop and Spark is that Hadoopâs MapReduce model is tightly coupled with the file formatsÂ of the data. On the other hand, Spark provides an abstraction to process the data using RDD. RDD is like a general-purpose container of distributed data. Thatâs why Spark can integrate with a variety of data stores.
Another main difference between Hadoop and Spark is that Spark makes good use of memory. It can cache data in memory to avoid disk I/O. On the other hand, Hadoopâs MapReduce jobs generally involve multiple disks I/O. Typically, a Hadoop job consists of multiple Map and Reduce jobs. This is known as MapReduce chaining. A MapReduce chain may look something like this:Â Map -> Reduce -> Map -> Map -> Reduce.
All of the reduce jobs write their output to HDFS for reliability; therefore, each map task next to it will have to read it from HDFS. This involves multiple disk I/O operations and makes overall processing slower. There have been several initiatives such asÂ Tez within Hadoop to optimize MapReduce processing. As discussed earlier, Spark creates a DAG of operations and automatically optimizes the disk reads.
Apart from the previous differences, Spark complements Hadoop by providing another way of processing the data. As discussed earlier in this chapter, it integrates well with Hadoop components such as Hive, YARN, and HDFS. The following diagramÂ shows a typical Spark and Hadoop ecosystem looks like. Spark makes use of YARN for scheduling and running its task throughout the cluster:
Spark and Hadoop
In this chapter, we introduced Apache Spark and its architecture. We discussed the concept of driver program and executors, which are the core components of Spark.
We then briefly discussed the different programming APIs for Spark, and its major components including Spark Core, Spark SQL, Spark Streaming, and Spark GraphX.Â
Finally, we discussed some major differences between Spark and Hadoop and how they complement each other.Â In the next chapter, we will install Spark on an AWS EC2 instance and go through different clients to interact with Spark.Â