Reader small image

You're reading from  Apache Spark Quick Start Guide

Product typeBook
Published inJan 2019
Reading LevelIntermediate
PublisherPackt
ISBN-139781789349108
Edition1st Edition
Languages
Right arrow
Authors (2):
Shrey Mehrotra
Shrey Mehrotra
author image
Shrey Mehrotra

Shrey Mehrotra has over 8 years of IT experience and, for the past 6 years, has been designing the architecture of cloud and big-data solutions for the finance, media, and governance sectors. Having worked on research and development with big-data labs and been part of Risk Technologies, he has gained insights into Hadoop, with a focus on Spark, HBase, and Hive. His technical strengths also include Elasticsearch, Kafka, Java, YARN, Sqoop, and Flume. He likes spending time performing research and development on different big-data technologies. He is the coauthor of the books Learning YARN and Hive Cookbook, a certified Hadoop developer, and he has also written various technical papers.
Read more about Shrey Mehrotra

Akash Grade
Akash Grade
author image
Akash Grade

Akash Grade is a data engineer living in New Delhi, India. Akash graduated with a BSc in computer science from the University of Delhi in 2011, and later earned an MSc in software engineering from BITS Pilani. He spends most of his time designing highly scalable data pipeline using big-data solutions such as Apache Spark, Hive, and Kafka. Akash is also a Databricks-certified Spark developer. He has been working on Apache Spark for the last five years, and enjoys writing applications in Python, Go, and SQL.
Read more about Akash Grade

View More author details
Right arrow

Spark RDD

Resilient Distributed Datasets (RDDs) are the basic building block of a Spark application. An RDD represents a read-only collection of objects distributed across multiple machines. Spark can distribute a collection of records using an RDD and process them in parallel on different machines.

In this chapter, we shall learn about the following:

    • What is an RDD?
    • How do you create RDDs?
    • Different operations available to work on RDDs
    • Important types of RDD
    • Caching an RDD
    • Partitions of an RDD
    • Drawbacks of using RDDs

The code examples in this chapter are written in Python and Scala only. If you wish to go through the Java and R APIs, you can visit the Spark documentation page at https://spark.apache.org/.

What is an RDD?

RDD is at the heart of every Spark application. Let's understand the meaning of each word in more detail:

  • Resilient: If we look at the meaning of resilient in the dictionary, we can see that it means to be: able to recover quickly from difficult conditions. Spark RDD has the ability to recreate itself if something goes wrong. You must be wondering, why does it need to recreate itself? Remember how HDFS and other data stores achieve fault tolerance? Yes, these systems maintain a replica of the data on multiple machines to recover in case of failure. But, as discussed in Chapter 1, Introduction to Apache Spark, Spark is not a data store; Spark is an execution engine. It reads the data from source systems, transforms it, and loads it into the target system. If something goes wrong while performing any of the previous steps, we will lose the data. To provide...

Programming using RDDs

An RDD can be created in four ways:

  • Parallelize a collection: This is one of the easiest ways to create an RDD. You can use the existing collection from your programs, such as List, Array, or Set, as well as others, and ask Spark to distribute that collection across the cluster to process it in parallel. A collection can be distributed with the help of parallelize(), as shown here:
#Python
numberRDD = spark.sparkContext.parallelize(range(1,10))
numberRDD.collect()

Out[4]: [1, 2, 3, 4, 5, 6, 7, 8, 9]

The following code performs the same operation in Scala:

//scala
val numberRDD = spark.sparkContext.parallelize(1 to 10)
numberRDD.collect()

res4: Array[Int] = Array(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
  • From an external dataset: Though parallelizing a collection is the easiest way to create an RDD, it is not the recommended way for the large datasets. Large datasets...

Transformations and actions

We have discussed some basic operations for creating and manipulating RDDs. Now it is time to categorize them into two main categories:

  • Transformations
  • Actions

Transformation

As the name suggests, transformations help us in transforming existing RDDs. As an output, they always create a new RDD that gets computed lazily. In the previous examples, we have discussed many transformations, such as map(), filter(), and reduceByKey().

Transformations are of two types:

  • Narrow transformations
  • Wide transformations

Narrow transformations

Narrow transformations...

Types of RDDs

RDDs can be categorized in multiple categories. Some of the examples include the following:

Hadoop RDD Shuffled RDD Pair RDD
Mapped RDD Union RDD JSON RDD
Filtered RDD Double RDD Vertex RDD

We will not discuss all of them in this chapter, as it is outside the scope of this chapter. But we will discuss one of the important types of RDD: pair RDDs.

Pair RDDs

A pair RDD is a special type of RDD that processes data in the form of key-value pairs. Pair RDD is very useful because it enables basic functionalities such as join and aggregations. Spark provides some special operations on these RDDs in an optimized way. If we recall the examples where we calculated the number of INFO and ERROR messages in...

Caching and checkpointing

Caching and checkpointing are some of the important features of Spark. These operations can improve the performance of your Spark jobs significantly.

Caching

Caching data into memory is one of the main features of Spark. You can cache large datasets in-memory or on-disk depending upon your cluster hardware. You can choose to cache your data in two scenarios:

  • Use the same RDD multiple times
  • Avoid reoccupation of an RDD that involves heavy computation, such as join() and groupByKey()

If you want to run multiple actions of an RDD, then it will be a good idea to cache it into the memory so that recompilation of this RDD can be avoided. For example, the following code first takes out a few elements...

Understanding partitions

Data partitioning plays a really important role in distributed computing, as it defines the degree of parallelism for the applications. Understating and defining partitions in the right way can significantly improve the performance of Spark jobs. There are two ways to control the degree of parallelism for RDD operations:

  • repartition() and coalesce()
  • partitionBy()

repartition() versus coalesce()

Partitions of an existing RDD can be changed using repartition() or coalesce(). These operations can redistribute the RDD based on the number of partitions provided. The repartition() can be used to increase or decrease the number of partitions, but it involves heavy data shuffling across the cluster. On...

Drawbacks of using RDDs

An RDD is a compile-time type-safe. That means, in the case of Scala and Java, if an operation is performed on the RDD that is not applicable to the underlying data type, then Spark will give a compile time error. This can avoid failures in production.

There are some drawbacks of using RDDs though:

  • RDD code can sometimes be very opaque. Developers might struggle to find out what exactly the code is trying to compute.
  • RDDs cannot be optimized by Spark, as Spark cannot look inside the lambda functions and optimize the operations. In some cases, where a filter() is piped after a wide transformation, Spark will never perform the filter first before the wide transformation, such as reduceByKey() or groupByKey().
  • RDDs are slower on non-JVM languages such as Python and R. In the case of these languages, a Python/R virtual machine is created alongside JVM. There...

Summary

In this chapter, we first learned about the basic idea of an RDD. We then looked at how we can create RDDs using different approaches, such as creating an RDD from an existing RDD, from an external data store, from parallelizing a collection, and from a DataFrame and datasets. We also looked at the different types of transformations and actions available on RDDs. Then, the different types of RDDs were discussed, especially the pair RDD. We also discussed the benefits of caching and checkpointing in Spark applications, and then we learned about the partitions in more detail, and how we can make use of features like partitioning, to optimize our Spark jobs.

In the end, we also discussed some of the drawbacks of using RDDs. In the next chapter, we'll discuss the DataFrame and dataset APIs and see how they can overcome these challenges.

...
lock icon
The rest of the chapter is locked
You have been reading a chapter from
Apache Spark Quick Start Guide
Published in: Jan 2019Publisher: PacktISBN-13: 9781789349108
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Authors (2)

author image
Shrey Mehrotra

Shrey Mehrotra has over 8 years of IT experience and, for the past 6 years, has been designing the architecture of cloud and big-data solutions for the finance, media, and governance sectors. Having worked on research and development with big-data labs and been part of Risk Technologies, he has gained insights into Hadoop, with a focus on Spark, HBase, and Hive. His technical strengths also include Elasticsearch, Kafka, Java, YARN, Sqoop, and Flume. He likes spending time performing research and development on different big-data technologies. He is the coauthor of the books Learning YARN and Hive Cookbook, a certified Hadoop developer, and he has also written various technical papers.
Read more about Shrey Mehrotra

author image
Akash Grade

Akash Grade is a data engineer living in New Delhi, India. Akash graduated with a BSc in computer science from the University of Delhi in 2011, and later earned an MSc in software engineering from BITS Pilani. He spends most of his time designing highly scalable data pipeline using big-data solutions such as Apache Spark, Hive, and Kafka. Akash is also a Databricks-certified Spark developer. He has been working on Apache Spark for the last five years, and enjoys writing applications in Python, Go, and SQL.
Read more about Akash Grade