The last few chapters have been the necessary groundwork to get Spark working. Now that you know how to load and save data in different ways, it's time for the big payoff, that is, manipulating data. The API you'll use to manipulate your RDD is similar among languages but not identical. Unlike the previous chapters, each language is covered in its own section here; should you wish, you could only read the one pertaining to the language you are interested in using.
You're reading from Fast Data Processing with Spark 2 - Third Edition
RDD is the primary low-level abstraction in Spark. As we discussed in the last chapter, the main programming APIs will be Datasets/DataFrames. However, underneath it all, the data will be represented as RDDs. So, understanding and working with RDDs is important. From a structural view, RDDs are just a bunch of elements-elements that can be operated in parallel.
RDD stands for Resilient Distributed Dataset, that is, it is distributed over a set of machines and the transformations are captured so that an RDD can be recreated in case there is a machine failure or memory corruption. One important aspect of the distributed parallel data representation scheme is that RDDs are immutable, which means when you do an operation, it generates a new RDD. Manipulating your RDD in Scala is quite simple, especially if you are familiar with Scala's collection library. Many of the standard functions are available directly on Spark's RDDs with the primary catch being...
Spark has a more limited Python API than Java and Scala, but it supports most of the core functionality.
The hallmark of a MapReduce
system lies in two commands: map
and reduce
. You've seen the map
function used in the earlier chapters. The map
function works by taking in a function that works on each individual element in the input RDD and produces a new output element. For example, to produce a new RDD where you have added one to every number, you would use rdd.map(lambda x: x+1)
. It's important to understand that the map
function and the other Spark functions do not transform the existing elements; instead, they return a new RDD with new elements. The reduce
function takes a function that operates in pairs to combine all of the data. This is returned to the calling program. If you were to sum all the elements, you would use rdd.reduce(lambda x, y: x+y)
. The flatMap
function is a useful utility function that allows you to write a function that returns an...
Some references are as follows:
http://www.scala-lang.org/api/current/index.html#scala.collection.immutable.List
http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.api.java.JavaRDD
http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.api.java.JavaPairRDD
http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.api.java.JavaDoubleRDD
https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.SparkContext
http://abshinn.github.io/python/apache-spark/2014/10/11/using-combinebykey-in-apache-spark/
Good examples of RDD transformations: (https://github.com/JerryLead/SparkLearning/tree/master/src)