Transformations and Actions

Transformations and actions are the main building blocks of an Apache Spark program. In this chapter, we will look at Spark transformations to defer computations and then look at which transformations should be avoided. We will then use the reduce and reduceByKey methods to carry out calculations from a dataset. We will then perform actions that trigger actual computations on graphs. By the end of this chapter, we will also have learned how to reuse the same rdd for different actions.

In this chapter, we will cover the following topics:

Using Spark transformations to defer computations to a later time
Avoiding transformations
Using the reduce and reduceByKey methods to calculate the result
Performing actions that trigger actual computations of our Directed Acyclic Graph (DAG)
Reusing the same rdd for different actions

...

Using Spark transformations to defer computations to a later time

Let's first understand Spark DAG creation. We will be executing DAG by issuing the action and also deferring the decision about starting the job until the last possible moment to check what this possibility gives us.

Let's have a look at the code we will be using in this section.

First, we need to initialize Spark. Every test we carry out will be the same. We need to initialize it before we start using it, as shown in the following example:

class DeferComputations extends FunSuite {
val spark: SparkContext = SparkSession.builder().master("local[2]").getOrCreate().sparkContext

Then, we will have the actual test. Here, test is called should defer computation. It is simple but shows a very powerful abstraction of Spark. We start by creating an rdd of InputRecord, as shown in the following example...

Avoiding transformations

In this section, we will look at the transformations that should be avoided. Here, we will focus on one particular transformation.

We will start by understanding the groupBy API. Then, we will investigate data partitioning when using groupBy, and then we will look at what a skew partition is and why should we avoid skew partitions.

Here, we are creating a list of transactions. UserTransaction is another model class that includes userId and amount. The following code block shows a typical transaction where we are creating a list of five transactions:

test("should trigger computations using actions") {
 //given
 val input = spark.makeRDD(
     List(
         UserTransaction(userId = "A", amount = 1001),
         UserTransaction(userId = "A", amount = 100),
         UserTransaction(userId = "A", amount = 102),
         UserTransaction...

Using the reduce and reduceByKey methods to calculate the results

In this section, we will use the reduce and reduceBykey functions to calculate our results and understand the behavior of reduce. We will then compare the reduce and reduceBykey functions to check which of the functions should be used in a particular use case.

We will first focus on the reduce API. First, we need to create an input of UserTransaction. We have the user transaction A with amount 10, B with amount 1, and A with amount 101. Let's say that we want to find out the global maximum. We are not interested in the data for the specific key, but in the global data. We want to scan it, take the maximum, and return it, as shown in the following example:

test("should use reduce API") {
    //given
    val input = spark.makeRDD(List(
    UserTransaction("A", 10),
    UserTransaction("B...

Performing actions that trigger computations

Spark has a lot more actions that issue DAG, and we should be aware of all of them because they are very important. In this section, we'll understand what can be an action in Spark, do a walk-through of actions, and test those actions if they behave as expected.

The first action we covered is collect. We also covered two actions besides that—we covered both reduce and reduceByKey in the previous section. Both methods are actions because they return a single result.

First, we will create the input of our transactions and then apply some transformations just for testing purposes. We will take only the user that contains A, using keyBy_.userId, and then take only the amount of the required transaction, as shown in the following example:

test("should trigger computations using actions") {
     //given
     val input...

Reusing the same rdd for different actions

In this section, we will reuse the same rdd for different actions. First, we will minimize the execution time by reusing the rdd. We will then look at caching and a performance test for our code.

The following example is the test from the preceding section but a bit modified, as here we take start by currentTimeMillis() and the result. So, we are just measuring the result of all actions that are executed:

//then every call to action means that we are going up to the RDD chain
//if we are loading data from external file-system (I.E.: HDFS), every action means
//that we need to load it from FS.
    val start = System.currentTimeMillis()
    println(rdd.collect().toList)
    println(rdd.count())
    println(rdd.first())
    rdd.foreach(println(_))
    rdd.foreachPartition(t => t.foreach(println(_)))
    println(rdd.max())
    println(rdd.min(...

Summary

So, let's sum up this chapter. Firstly, we used Spark transformations to defer computation to a later time, and then we learned which transformations should be avoided. Next, we looked at how to use reduceByKey and reduce to calculate our result globally and per specific key. After that, we performed actions that triggered computations then learned that every action means a call to the loading data. To alleviate that problem, we learned how to reduce the same rdd for different actions.

In the next chapter, we'll be looking at the immutable design of the Spark engine.