Search icon
Subscription
0
Cart icon
Close icon
You have no products in your basket yet
Arrow left icon
All Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletters
Free Learning
Arrow right icon
Apache Spark Quick Start Guide
Apache Spark Quick Start Guide

Apache Spark Quick Start Guide: Quickly learn the art of writing efficient big data applications with Apache Spark

By Shrey Mehrotra , Akash Grade
$15.99 per month
Book Jan 2019 154 pages 1st Edition
eBook
$22.99 $15.99
Print
$32.99
Subscription
$15.99 Monthly
eBook
$22.99 $15.99
Print
$32.99
Subscription
$15.99 Monthly

What do you get with a Packt Subscription?

Free for first 7 days. $15.99 p/m after that. Cancel any time!
Product feature icon Unlimited ad-free access to the largest independent learning library in tech. Access this title and thousands more!
Product feature icon 50+ new titles added per month, including many first-to-market concepts and exclusive early access to books as they are being written.
Product feature icon Innovative learning tools, including AI book assistants, code context explainers, and text-to-speech.
Product feature icon Thousands of reference materials covering every tech concept you need to stay up to date.
Subscribe now
View plans & pricing

Product Details


Publication date : Jan 31, 2019
Length 154 pages
Edition : 1st Edition
Language : English
ISBN-13 : 9781789349108
Vendor :
Apache
Category :
Concepts :
Table of content icon View table of contents Preview book icon Preview Book

Apache Spark Quick Start Guide

Introduction to Apache Spark

Apache Spark is an open source framework for processing large datasets stored in heterogeneous data stores in an efficient and fast way. Sophisticated analytical algorithms can be easily executed on these large datasets. Spark can execute a distributed program 100 times faster than MapReduce. As Spark is one of the fast-growing projects in the open source community, it provides a large number of libraries to its users.

We shall cover the following topics in this chapter:

  • A brief introduction to Spark
  • Spark architecture and the different languages that can be used for coding Spark applications
  • Spark components and how these components can be used together to solve a variety of use cases
  • A comparison between Spark and Hadoop

What is Spark?

Apache Spark is a distributed computing framework which makes big-data processing quite easy, fast, and scalable. You must be wondering what makes Spark so popular in the industry, and how is it really different than the existing tools available for big-data processing? The reason is that it provides a unified stack for processing all different kinds of big data, be it batch, streaming, machine learning, or graph data.

Spark was developed at UC Berkeley’s AMPLab in 2009 and later came under the Apache Umbrella in 2010. The framework is mainly written in Scala and Java.

Spark provides an interface with many different distributed and non-distributed data stores, such as Hadoop Distributed File System (HDFS), Cassandra, Openstack Swift, Amazon S3, and Kudu. It also provides a wide variety of language APIs to perform analytics on the data stored in these data stores. These APIs include Scala, Java, Python, and R.

The basic entity of Spark is Resilient Distributed Dataset (RDD), which is a read-only partitioned collection of data. RDD can be created using data stored on different data stores or using existing RDD. We shall discuss this in more detail in Chapter 3, Spark RDD.

Spark needs a resource manager to distribute and execute its tasks. By default, Spark comes up with its own standalone scheduler, but it integrates easily with Apache Mesos and Yet Another Resource Negotiator (YARN) for cluster resource management and task execution.

One of the main features of Spark is to keep a large amount of data in memory for faster execution. It also has a component that generates a Directed Acyclic Graph (DAG) of operations based on the user program. We shall discuss these in more details in coming chapters.

The following diagram shows some of the popular data stores Spark can connect to:

Data stores
Spark is a computing engine, and should not be considered as a storage system as well. Spark is also not designed for cluster management. For this purpose, frameworks such as Mesos and YARN are used.

Spark architecture overview

Spark follows a master-slave architecture, as it allows it to scale on demand. Spark's architecture has two main components:

  • Driver Program: A driver program is where a user writes Spark code using either Scala, Java, Python, or R APIs. It is responsible for launching various parallel operations of the cluster.
  • Executor: Executor is the Java Virtual Machine (JVM) that runs on a worker node of the cluster. Executor provides hardware resources for running the tasks launched by the driver program.

As soon as a Spark job is submitted, the driver program launches various operation on each executor. Driver and executors together make an application.

The following diagram demonstrates the relationships between Driver, Workers, and Executors. As the first step, a driver process parses the user code (Spark Program) and creates multiple executors on each worker node. The driver process not only forks the executors on work machines, but also sends tasks to these executors to run the entire application in parallel.

Once the computation is completed, the output is either sent to the driver program or saved on to the file system:

Driver, Workers, and Executors

Spark language APIs

Spark has integration with a variety of programming languages such as Scala, Java, Python, and R. Developers can write their Spark program in either of these languages. This freedom of language is also one of the reasons why Spark is popular among developers. If you compare this to Hadoop MapReduce, in MapReduce, the developers had only one choice: Java, which made it difficult for developers from another programming languages to work on MapReduce.

Scala

Scala is the primary language for Spark. More than 70% of Spark's code is written in Scalable Language (Scala). Scala is a fairly new language. It was developed by Martin Odersky in 2001, and it was first launched publicly in 2004. Like Java, Scala also generates a bytecode that runs on JVM. Scala brings advantages from both object-oriented and functional-oriented worlds. It provides dynamic programming without compromising on type safety. As Spark is primarily written in Scala, you can find almost all of the new libraries in Scala API.

Java

Most of us are familiar with Java. Java is a powerful object-oriented programming language. The majority of big data frameworks are written in Java, which provides rich libraries to connect and process data with these frameworks.

Python

Python is a functional programming language. It was developed by Guido van Rossum and was first released in 1991. For some time, Python was not popular among developers, but later, around 2006-07, it introduced some libraries such as Numerical Python (NumPy) and Pandas, which became cornerstones and made Python popular among all types of programmers. In Spark, when the driver launches executors on worker nodes, it also starts a Python interpreter for each executor. In the case of RDD, the data is first shipped into the JVMs, and is then transferred to Python, which makes the job slow when working with RDDs.

R

R is a statistical programming language. It provides a rich library for analyzing and manipulating the data, which is why it is very popular among data analysts, statisticians, and data scientists. Spark R integration is a way to provide data scientists the flexibility required to work on big data. Like Python, SparkR also creates an R process for each executor to work on data transferred from the JVM.

SQL

Structured Query Language (SQL) is one of the most popular and powerful languages for working with tables stored in the database. SQL also enables non-programmers to work with big data. Spark provides Spark SQL, which is a distributed SQL query engine. We will learn about it in more detail in Chapter 6, Spark SQL.

Spark components

As discussed earlier in this chapter, the main philosophy behind Spark is to provide a unified engine for creating different types of big data applications. Spark provides a variety of libraries to work with batch analytics, streaming, machine learning, and graph analysis.

It is not as if these kinds of processing were never done before Spark, but for every new big data problem, there was a new tool in the market; for example, for batch analysis, we had MapReduce, Hive, and Pig. For Streaming, we had Apache Storm, for machine learning, we had Mahout. Although these tools solve the problems that they are designed for, each of them requires a learning curve. This is where Spark brings advantages. Spark provides a unified stack for solving all of these problems. It has components that are designed for processing all kinds of big data. It also provides many libraries to read or write different kinds of data such as JSON, CSV, and Parquet.

Here is an example of a Spark stack:

Spark stack

Having a unified stack brings lots of advantages. Let's look at some of the advantages:

  • First is code sharing and reusability. Components developed by the data engineering team can easily be integrated by the data science team to avoid code redundancy.
  • Secondly, there is always a new tool coming in the market to solve a different big data usecase. Most of the developers struggle to learn new tools and gain expertise in order to use them efficiently. With Spark, developers just have to learn the basic concepts which allows developers to work on different big data use cases.
  • Thirdly, its unified stack gives great power to the developers to explore new ideas without installing new tools.

The following diagram provides a high-level overview of different big-data applications powered by Spark:

Spark use cases

Spark Core

Spark Core is the main component of Spark. Spark Core defines the following:

  • The basic components, such as RDD and DataFrames
  • The APIs available to perform operations on these basic abstractions
  • Shared or distributed variables, such as broadcast variables and accumulators

We shall look at them in more detail in the upcoming chapters.

Spark Core also defines all the basic functionalities, such as task management, memory management, basic I/O functionalities, and more. It’s a good idea to have a look at the Spark code on GitHub (https://github.com/apache/spark).

Spark SQL

Spark SQL is where developers can work with structured and semi-structured data such as Hive tables, MySQL tables, Parquet files, AVRO files, JSON files, CSV files, and more. Another alternative to process structured data is using Hive. Hive processes structured data stored on HDFS using Hive Query Language (HQL). It internally uses MapReduce for its processing, and we shall see how Spark can deliver better performance than MapReduce. In the initial version of Spark, structured data used to be defined as schema RDD (another type of an RDD). When there is data along with the schema, SQL becomes the first choice of processing that data. Spark SQL is Spark's component that enables developers to process data with Structured Query Language (SQL).

Using Spark SQL, business logic can be easily written in SQL and HQL. This enables data warehouse engineers with a good knowledge of SQL to make use of Spark for their extract, transform, load (ETL) processing. Hive projects can easily be migrated on Spark using Spark SQL, without changing the Hive scripts.

Spark SQL is also the first choice for data analysis and data warehousing. Spark SQL enables the data analysts to write ad hoc queries for their exploratory analysis. Spark provides Spark SQL shell, where you can run the SQL-like queries and they get executed on Spark. Spark internally converts the code into a chain of RDD computations, while Hive converts the HQL job into a series of MapReduce jobs. Using Spark SQL, developers can also make use of caching (a Spark feature that enables data to be kept in memory), which can significantly increase the performance of their queries.

Spark Streaming

Spark Streaming is a package that is used to process a stream of data in real time. There can be many different types of a real-time stream of data; for example, an e-commerce website recording page visits in real time, credit card transactions, a taxi provider app sending information about trips and location information of drivers and passengers, and more. In a nutshell, all of these applications are hosted on multiple web servers that generate event logs in real time.

Spark Streaming makes use of RDD and defines some more APIs to process the data stream in real time. As Spark Streaming makes use of RDD and its APIs, it is easy for developers to learn and execute the use cases without learning a whole new technology stack.

Spark 2.x introduced structured streaming, which makes use of DataFrames rather than RDD to process the data stream. Using DataFrames as its computation abstraction brings all the benefits of the DataFrame API to stream processing. We shall discuss the benefits of DataFrames over RDD in coming chapters.

Spark Streaming has excellent integration with some of the most popular data messaging queues, such as Apache Flume and Kafka. It can be easily plugged into these queues to handle a massive amount of data streams.

Spark machine learning

It is difficult to run a machine-learning algorithm when your data is distributed across multiple machines. There might be a case when the calculation depends on another point that is stored or processed on a different executor. Data can be shuffling across executors or workers, but shuffle comes with a heavy cost. Spark provides a way to avoid shuffling data. Yes, it is caching. Spark's ability to keep a large amount of data in memory makes it easy to write machine-learning algorithms.

Spark MLlib and ML are the Spark’s packages to work with machine-learning algorithms. They provide the following:

  • Inbuilt machine-learning algorithms such as Classification, Regression, Clustering, and more
  • Features such as pipelining, vector creation, and more

The previous algorithms and features are optimized for data shuffle and to scale across the cluster.

Spark graph processing

Spark also has a component to process graph data. A graph consists of vertices and edges. Edges define the relationship between vertices. Some examples of graph data are customers's product ratings, social networks, Wikipedia pages and their links, airport flights, and more.

Spark provides GraphX to process such data. GraphX makes use of RDD for its computation and allows users to create vertices and edges with some properties. Using GraphX, you can define and manipulate a graph or get some insights from the graph.

GraphFrames is an external package that makes use of DataFrames instead of RDD, and defines vertex-edge relation using a DataFrame.

Cluster manager

Spark provides a local mode for the job execution, where both driver and executors run within a single JVM on the client machine. This enables developers to quickly get started with Spark without creating a cluster. We will mostly use this mode of job execution throughout this book for our code examples, and explain the possible challenges with a cluster mode whenever possible. Spark also works with a variety of schedules. Let’s have a quick overview of them here.

Standalone scheduler

Spark comes with its own scheduler, called a standalone scheduler. If you are running your Spark programs on a cluster that does not have a Hadoop installation, then there is a chance that you are using Spark’s default standalone scheduler.

YARN

YARN is the default scheduler of Hadoop. It is optimized for batch jobs such as MapReduce, Hive, and Pig. Most of the organizations already have Hadoop installed on their clusters; therefore, Spark provides the ability to configure it with YARN for the job scheduling.

Mesos

Spark also integrates well with Apache Mesos which is build using the same principles as the Linux kernel. Unlike YARN, Apache Mesos is general purpose cluster manager that does not bind to the Hadoop ecosystem. Another difference between YARN and Mesos is that YARN is optimized for the long-running batch workloads, whereas Mesos, ability to provide a fine-grained and dynamic allocation of resources makes it more optimized for streaming jobs.

Kubernetes

Kubernetes is as a general-purpose orchestration framework for running containerized applications. Kubernetes provides multiple features such as multi-tenancy (running different versions of Spark on a physical cluster) and sharing of the namespace. At the time of writing this book, the Kubernetes scheduler is still in the experimental stage. For more details on running a Spark application on Kubernetes, please refer to Spark's documentation.

Making the most of Hadoop and Spark

People generally get confused between Hadoop and Spark and how they are related. The intention of this section is to discuss the differences between Hadoop and Spark, and also how they can be used together.

Hadoop is mainly a combination of the following components:

  • Hive and Pig
  • MapReduce
  • YARN
  • HDFS

HDFS is the storage layer where underlying data can be stored. HDFS provides features such as the replication of the data, fault tolerance, high availability, and more. Hadoop is schema-on-read; for instance, you don’t have to specify the schema while writing the data to Hadoop, rather, you can use different schemas while reading the data. HDFS also provides different types of files formats, such as TextInputFormat, SequenceFile, NLInputFormat, and more. If you want to know more about these file formats, I would recommend reading Hadoop: The Definitive Guide by Tom White.

Hadoop’s MapReduce is a programming model used to process the data available on HDFS. It consists of four main phases: Map, Sort, Shuffle, and Reduce. One of the main differences between Hadoop and Spark is that Hadoop’s MapReduce model is tightly coupled with the file formats of the data. On the other hand, Spark provides an abstraction to process the data using RDD. RDD is like a general-purpose container of distributed data. That’s why Spark can integrate with a variety of data stores.

Another main difference between Hadoop and Spark is that Spark makes good use of memory. It can cache data in memory to avoid disk I/O. On the other hand, Hadoop’s MapReduce jobs generally involve multiple disks I/O. Typically, a Hadoop job consists of multiple Map and Reduce jobs. This is known as MapReduce chaining. A MapReduce chain may look something like this: Map -> Reduce -> Map -> Map -> Reduce.

All of the reduce jobs write their output to HDFS for reliability; therefore, each map task next to it will have to read it from HDFS. This involves multiple disk I/O operations and makes overall processing slower. There have been several initiatives such as Tez within Hadoop to optimize MapReduce processing. As discussed earlier, Spark creates a DAG of operations and automatically optimizes the disk reads.

Apart from the previous differences, Spark complements Hadoop by providing another way of processing the data. As discussed earlier in this chapter, it integrates well with Hadoop components such as Hive, YARN, and HDFS. The following diagram shows a typical Spark and Hadoop ecosystem looks like. Spark makes use of YARN for scheduling and running its task throughout the cluster:

Spark and Hadoop

Summary

In this chapter, we introduced Apache Spark and its architecture. We discussed the concept of driver program and executors, which are the core components of Spark.

We then briefly discussed the different programming APIs for Spark, and its major components including Spark Core, Spark SQL, Spark Streaming, and Spark GraphX.

Finally, we discussed some major differences between Spark and Hadoop and how they complement each other. In the next chapter, we will install Spark on an AWS EC2 instance and go through different clients to interact with Spark.

Left arrow icon Right arrow icon
Download code icon Download Code

Key benefits

  • Learn about the core concepts and the latest developments in Apache Spark
  • Master writing efficient big data applications with Spark’s built-in modules for SQL, Streaming, Machine Learning and Graph analysis
  • Get introduced to a variety of optimizations based on the actual experience

Description

Apache Spark is a ?exible framework that allows processing of batch and real-time data. Its unified engine has made it quite popular for big data use cases. This book will help you to get started with Apache Spark 2.0 and write big data applications for a variety of use cases. It will also introduce you to Apache Spark – one of the most popular Big Data processing frameworks. Although this book is intended to help you get started with Apache Spark, but it also focuses on explaining the core concepts. This practical guide provides a quick start to the Spark 2.0 architecture and its components. It teaches you how to set up Spark on your local machine. As we move ahead, you will be introduced to resilient distributed datasets (RDDs) and DataFrame APIs, and their corresponding transformations and actions. Then, we move on to the life cycle of a Spark application and learn about the techniques used to debug slow-running applications. You will also go through Spark’s built-in modules for SQL, streaming, machine learning, and graph analysis. Finally, the book will lay out the best practices and optimization techniques that are key for writing efficient Spark applications. By the end of this book, you will have a sound fundamental understanding of the Apache Spark framework and you will be able to write and optimize Spark applications.

What you will learn

Learn core concepts such as RDDs, DataFrames, transformations, and more Set up a Spark development environment Choose the right APIs for your applications Understand Spark’s architecture and the execution ?ow of a Spark application Explore built-in modules for SQL, streaming, ML, and graph analysis Optimize your Spark job for better performance

What do you get with a Packt Subscription?

Free for first 7 days. $15.99 p/m after that. Cancel any time!
Product feature icon Unlimited ad-free access to the largest independent learning library in tech. Access this title and thousands more!
Product feature icon 50+ new titles added per month, including many first-to-market concepts and exclusive early access to books as they are being written.
Product feature icon Innovative learning tools, including AI book assistants, code context explainers, and text-to-speech.
Product feature icon Thousands of reference materials covering every tech concept you need to stay up to date.
Subscribe now
View plans & pricing

Product Details


Publication date : Jan 31, 2019
Length 154 pages
Edition : 1st Edition
Language : English
ISBN-13 : 9781789349108
Vendor :
Apache
Category :
Concepts :

Table of Contents

10 Chapters
Preface Chevron down icon Chevron up icon
1. Introduction to Apache Spark Chevron down icon Chevron up icon
2. Apache Spark Installation Chevron down icon Chevron up icon
3. Spark RDD Chevron down icon Chevron up icon
4. Spark DataFrame and Dataset Chevron down icon Chevron up icon
5. Spark Architecture and Application Execution Flow Chevron down icon Chevron up icon
6. Spark SQL Chevron down icon Chevron up icon
7. Spark Streaming, Machine Learning, and Graph Analysis Chevron down icon Chevron up icon
8. Spark Optimizations Chevron down icon Chevron up icon
9. Other Books You May Enjoy Chevron down icon Chevron up icon

Customer reviews

Filter icon Filter
Top Reviews
Rating distribution
Empty star icon Empty star icon Empty star icon Empty star icon Empty star icon 0
(0 Ratings)
5 star 0%
4 star 0%
3 star 0%
2 star 0%
1 star 0%

Filter reviews by


No reviews found
Get free access to Packt library with over 7500+ books and video courses for 7 days!
Start Free Trial

FAQs

What is included in a Packt subscription? Chevron down icon Chevron up icon

A subscription provides you with full access to view all Packt and licnesed content online, this includes exclusive access to Early Access titles. Depending on the tier chosen you can also earn credits and discounts to use for owning content

How can I cancel my subscription? Chevron down icon Chevron up icon

To cancel your subscription with us simply go to the account page - found in the top right of the page or at https://subscription.packtpub.com/my-account/subscription - From here you will see the ‘cancel subscription’ button in the grey box with your subscription information in.

What are credits? Chevron down icon Chevron up icon

Credits can be earned from reading 40 section of any title within the payment cycle - a month starting from the day of subscription payment. You also earn a Credit every month if you subscribe to our annual or 18 month plans. Credits can be used to buy books DRM free, the same way that you would pay for a book. Your credits can be found in the subscription homepage - subscription.packtpub.com - clicking on ‘the my’ library dropdown and selecting ‘credits’.

What happens if an Early Access Course is cancelled? Chevron down icon Chevron up icon

Projects are rarely cancelled, but sometimes it's unavoidable. If an Early Access course is cancelled or excessively delayed, you can exchange your purchase for another course. For further details, please contact us here.

Where can I send feedback about an Early Access title? Chevron down icon Chevron up icon

If you have any feedback about the product you're reading, or Early Access in general, then please fill out a contact form here and we'll make sure the feedback gets to the right team. 

Can I download the code files for Early Access titles? Chevron down icon Chevron up icon

We try to ensure that all books in Early Access have code available to use, download, and fork on GitHub. This helps us be more agile in the development of the book, and helps keep the often changing code base of new versions and new technologies as up to date as possible. Unfortunately, however, there will be rare cases when it is not possible for us to have downloadable code samples available until publication.

When we publish the book, the code files will also be available to download from the Packt website.

How accurate is the publication date? Chevron down icon Chevron up icon

The publication date is as accurate as we can be at any point in the project. Unfortunately, delays can happen. Often those delays are out of our control, such as changes to the technology code base or delays in the tech release. We do our best to give you an accurate estimate of the publication date at any given time, and as more chapters are delivered, the more accurate the delivery date will become.

How will I know when new chapters are ready? Chevron down icon Chevron up icon

We'll let you know every time there has been an update to a course that you've bought in Early Access. You'll get an email to let you know there has been a new chapter, or a change to a previous chapter. The new chapters are automatically added to your account, so you can also check back there any time you're ready and download or read them online.

I am a Packt subscriber, do I get Early Access? Chevron down icon Chevron up icon

Yes, all Early Access content is fully available through your subscription. You will need to have a paid for or active trial subscription in order to access all titles.

How is Early Access delivered? Chevron down icon Chevron up icon

Early Access is currently only available as a PDF or through our online reader. As we make changes or add new chapters, the files in your Packt account will be updated so you can download them again or view them online immediately.

How do I buy Early Access content? Chevron down icon Chevron up icon

Early Access is a way of us getting our content to you quicker, but the method of buying the Early Access course is still the same. Just find the course you want to buy, go through the check-out steps, and you’ll get a confirmation email from us with information and a link to the relevant Early Access courses.

What is Early Access? Chevron down icon Chevron up icon

Keeping up to date with the latest technology is difficult; new versions, new frameworks, new techniques. This feature gives you a head-start to our content, as it's being created. With Early Access you'll receive each chapter as it's written, and get regular updates throughout the product's development, as well as the final course as soon as it's ready.We created Early Access as a means of giving you the information you need, as soon as it's available. As we go through the process of developing a course, 99% of it can be ready but we can't publish until that last 1% falls in to place. Early Access helps to unlock the potential of our content early, to help you start your learning when you need it most. You not only get access to every chapter as it's delivered, edited, and updated, but you'll also get the finalized, DRM-free product to download in any format you want when it's published. As a member of Packt, you'll also be eligible for our exclusive offers, including a free course every day, and discounts on new and popular titles.