Packt+ | Advance your knowledge in tech

You're reading from Apache Spark for Data Science Cookbook

Product type Book

Published in Dec 2016

Publisher

ISBN-13 9781785880100

Pages 392 pages

Edition 1st Edition

Languages

Concepts

Data Science

Author (1):

Padma Priya Chitturi

Table of Contents (17) Chapters

Apache Spark for Data Science Cookbook

Credits

About the Author

About the Reviewer

www.PacktPub.com

Customer Feedback

Preface

1. Big Data Analytics with Spark

2. Tricky Statistics with Spark

3. Data Analysis with Spark

4. Clustering, Classification, and Regression

5. Working with Spark MLlib

6. NLP with Spark

7. Working with Sparkling Water - H2O

8. Data Visualization with Spark

9. Deep Learning on Spark

10. Working with SparkR

Persisting RDDs

This recipe shows how to persist an RDD. As a known fact, RDDs are lazily evaluated and sometimes it is necessary to reuse the RDD multiple times. In such cases, Spark will re-compute the RDD and all of its dependencies, each time we call an action on the RDD. This is expensive for iterative algorithms which need the computed dataset multiple times. To avoid computing an RDD multiple times, Spark provides a mechanism for persisting the data in an RDD.

After the first time an action computes the RDD's contents, they can be stored in memory or disk across the cluster. The next time an action depends on the RDD, it need not be recomputed from its dependencies.

Getting ready

To step through this recipe, you will need a running Spark cluster either in pseudo distributed mode or in one of the distributed modes, that is, standalone, YARN, or Mesos.

How to do it…

Let's see how to persist RDDs using the following code:

val inputRdd = sc.parallelize(Array("this,is,a,ball","it,is,a,cat","julie,is,in,the,church")) 
val wordsRdd = inputRdd.flatMap(record => record.split(",")) 
val wordLengthPairs = wordsRdd.map(word before code=> (word, word.length)) 
val wordPairs = wordsRdd.map(word => (word,1)) 
val reducedWordCountRdd = wordPairs.reduceByKey((x,y) => x+y) 
val filteredWordLengthPairs = wordLengthPairs.filter{case(word,length) => length >=3} 
reducedWordCountRdd.cache() 
val joinedRdd = reducedWordCountRdd.join(filteredWordLengthPairs) 
joinedRdd.persist(StorageLevel.MEMORY_AND_DISK) 
val wordPairsCount =  reducedWordCountRdd.count 
val wordPairsCollection = reducedWordCountRdd.take(10)  
val joinedRddCount = joinedRdd.count 
val joinedPairs = joinedRdd.collect() 
reducedWordCountRdd.unpersist() 
joinedRdd.unpersist()

How it works…

The call to cache() on reducedWordCountRdd indicates that the RDD should be stored in memory for the next time it's computed. The count action computes it initially. When the take action is invoked, it accesses the cached elements of the RDD instead of re-computing them from the dependencies.

Spark defines levels of persistence or StorageLevel values for persisting RDDs. rdd.cache() is shorthand for rdd.persist(StorageLevel.MEMORY). In the preceding example, joinedRdd is persisted with storage level as MEMORY_AND_DISK which indicates persisting the RDD in memory as well as in disk. It is good practice to un-persist the RDD at the end, which lets us manually remove it from the cache.

There's more…

Spark defines various levels of persistence, such as MEMORY_ONLY, MEMORY_AND_DISK, MEMORY_AND_DISK2, and so on. Deciding when to cache/persist the data can be an art. The decision typically involves trade-offs between space and speed. If you attempt to cache too much data to fit in memory, Spark will use the LRU cache policy to evict old partitions. In general, RDDs should be persisted when they are likely to be referenced by multiple actions and are expensive to regenerate.