Search icon
Arrow left icon
All Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletters
Free Learning
Arrow right icon
Hands-On Big Data Analytics with PySpark

You're reading from  Hands-On Big Data Analytics with PySpark

Product type Book
Published in Mar 2019
Publisher Packt
ISBN-13 9781838644130
Pages 182 pages
Edition 1st Edition
Languages
Concepts
Authors (2):
Rudy Lai Rudy Lai
Profile icon Rudy Lai
Bartłomiej Potaczek Bartłomiej Potaczek
Profile icon Bartłomiej Potaczek
View More author details

Table of Contents (15) Chapters

Preface Installing Pyspark and Setting up Your Development Environment Getting Your Big Data into the Spark Environment Using RDDs Big Data Cleaning and Wrangling with Spark Notebooks Aggregating and Summarizing Data into Useful Reports Powerful Exploratory Data Analysis with MLlib Putting Structure on Your Big Data with SparkSQL Transformations and Actions Immutable Design Avoiding Shuffle and Reducing Operational Expenses Saving Data in the Correct Format Working with the Spark Key/Value API Testing Apache Spark Jobs Leveraging the Spark GraphX API Other Books You May Enjoy

Transformations and Actions

Transformations and actions are the main building blocks of an Apache Spark program. In this chapter, we will look at Spark transformations to defer computations and then look at which transformations should be avoided. We will then use the reduce and reduceByKey methods to carry out calculations from a dataset. We will then perform actions that trigger actual computations on graphs. By the end of this chapter, we will also have learned how to reuse the same rdd for different actions.

In this chapter, we will cover the following topics:

  • Using Spark transformations to defer computations to a later time
  • Avoiding transformations
  • Using the reduce and reduceByKey methods to calculate the result
  • Performing actions that trigger actual computations of our Directed Acyclic Graph (DAG)
  • Reusing the same rdd for different actions
...

Using Spark transformations to defer computations to a later time

Let's first understand Spark DAG creation. We will be executing DAG by issuing the action and also deferring the decision about starting the job until the last possible moment to check what this possibility gives us.

Let's have a look at the code we will be using in this section.

First, we need to initialize Spark. Every test we carry out will be the same. We need to initialize it before we start using it, as shown in the following example:

class DeferComputations extends FunSuite {
val spark: SparkContext = SparkSession.builder().master("local[2]").getOrCreate().sparkContext

Then, we will have the actual test. Here, test is called should defer computation. It is simple but shows a very powerful abstraction of Spark. We start by creating an rdd of InputRecord, as shown in the following example...

Avoiding transformations

In this section, we will look at the transformations that should be avoided. Here, we will focus on one particular transformation.

We will start by understanding the groupBy API. Then, we will investigate data partitioning when using groupBy, and then we will look at what a skew partition is and why should we avoid skew partitions.

Here, we are creating a list of transactions. UserTransaction is another model class that includes userId and amount. The following code block shows a typical transaction where we are creating a list of five transactions:

test("should trigger computations using actions") {
//given
val input = spark.makeRDD(
List(
UserTransaction(userId = "A", amount = 1001),
UserTransaction(userId = "A", amount = 100),
UserTransaction(userId = "A", amount = 102),
UserTransaction...

Using the reduce and reduceByKey methods to calculate the results

In this section, we will use the reduce and reduceBykey functions to calculate our results and understand the behavior of reduce. We will then compare the reduce and reduceBykey functions to check which of the functions should be used in a particular use case.

We will first focus on the reduce API. First, we need to create an input of UserTransaction. We have the user transaction A with amount 10, B with amount 1, and A with amount 101. Let's say that we want to find out the global maximum. We are not interested in the data for the specific key, but in the global data. We want to scan it, take the maximum, and return it, as shown in the following example:

test("should use reduce API") {
//given
val input = spark.makeRDD(List(
UserTransaction("A", 10),
UserTransaction("B...

Performing actions that trigger computations

Spark has a lot more actions that issue DAG, and we should be aware of all of them because they are very important. In this section, we'll understand what can be an action in Spark, do a walk-through of actions, and test those actions if they behave as expected.

The first action we covered is collect. We also covered two actions besides that—we covered both reduce and reduceByKey in the previous section. Both methods are actions because they return a single result.

First, we will create the input of our transactions and then apply some transformations just for testing purposes. We will take only the user that contains A, using keyBy_.userId, and then take only the amount of the required transaction, as shown in the following example:

test("should trigger computations using actions") {
//given
val input...

Reusing the same rdd for different actions

In this section, we will reuse the same rdd for different actions. First, we will minimize the execution time by reusing the rdd. We will then look at caching and a performance test for our code.

The following example is the test from the preceding section but a bit modified, as here we take start by currentTimeMillis() and the result. So, we are just measuring the result of all actions that are executed:

//then every call to action means that we are going up to the RDD chain
//if we are loading data from external file-system (I.E.: HDFS), every action means
//that we need to load it from FS.
val start = System.currentTimeMillis()
println(rdd.collect().toList)
println(rdd.count())
println(rdd.first())
rdd.foreach(println(_))
rdd.foreachPartition(t => t.foreach(println(_)))
println(rdd.max())
println(rdd.min(...

Summary

So, let's sum up this chapter. Firstly, we used Spark transformations to defer computation to a later time, and then we learned which transformations should be avoided. Next, we looked at how to use reduceByKey and reduce to calculate our result globally and per specific key. After that, we performed actions that triggered computations then learned that every action means a call to the loading data. To alleviate that problem, we learned how to reduce the same rdd for different actions.

In the next chapter, we'll be looking at the immutable design of the Spark engine.

lock icon The rest of the chapter is locked
You have been reading a chapter from
Hands-On Big Data Analytics with PySpark
Published in: Mar 2019 Publisher: Packt ISBN-13: 9781838644130
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime}