Search icon
Subscription
0
Cart icon
Close icon
You have no products in your basket yet
Arrow left icon
All Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletters
Free Learning
Arrow right icon
Apache Spark for Data Science Cookbook

You're reading from  Apache Spark for Data Science Cookbook

Product type Book
Published in Dec 2016
Publisher
ISBN-13 9781785880100
Pages 392 pages
Edition 1st Edition
Languages
Author (1):
Padma Priya Chitturi Padma Priya Chitturi
Profile icon Padma Priya Chitturi

Table of Contents (17) Chapters

Apache Spark for Data Science Cookbook
Credits
About the Author
About the Reviewer
www.PacktPub.com
Customer Feedback
Preface
1. Big Data Analytics with Spark 2. Tricky Statistics with Spark 3. Data Analysis with Spark 4. Clustering, Classification, and Regression 5. Working with Spark MLlib 6. NLP with Spark 7. Working with Sparkling Water - H2O 8. Data Visualization with Spark 9. Deep Learning on Spark 10. Working with SparkR

Persisting RDDs


This recipe shows how to persist an RDD. As a known fact, RDDs are lazily evaluated and sometimes it is necessary to reuse the RDD multiple times. In such cases, Spark will re-compute the RDD and all of its dependencies, each time we call an action on the RDD. This is expensive for iterative algorithms which need the computed dataset multiple times. To avoid computing an RDD multiple times, Spark provides a mechanism for persisting the data in an RDD.

After the first time an action computes the RDD's contents, they can be stored in memory or disk across the cluster. The next time an action depends on the RDD, it need not be recomputed from its dependencies.

Getting ready

To step through this recipe, you will need a running Spark cluster either in pseudo distributed mode or in one of the distributed modes, that is, standalone, YARN, or Mesos.

How to do it…

Let's see how to persist RDDs using the following code:

val inputRdd = sc.parallelize(Array("this,is,a,ball","it,is,a,cat","julie,is,in,the,church")) 
val wordsRdd = inputRdd.flatMap(record => record.split(",")) 
val wordLengthPairs = wordsRdd.map(word before code=> (word, word.length)) 
val wordPairs = wordsRdd.map(word => (word,1)) 
val reducedWordCountRdd = wordPairs.reduceByKey((x,y) => x+y) 
val filteredWordLengthPairs = wordLengthPairs.filter{case(word,length) => length >=3} 
reducedWordCountRdd.cache() 
val joinedRdd = reducedWordCountRdd.join(filteredWordLengthPairs) 
joinedRdd.persist(StorageLevel.MEMORY_AND_DISK) 
val wordPairsCount =  reducedWordCountRdd.count 
val wordPairsCollection = reducedWordCountRdd.take(10)  
val joinedRddCount = joinedRdd.count 
val joinedPairs = joinedRdd.collect() 
reducedWordCountRdd.unpersist() 
joinedRdd.unpersist() 

How it works…

The call to cache() on reducedWordCountRdd indicates that the RDD should be stored in memory for the next time it's computed. The count action computes it initially. When the take action is invoked, it accesses the cached elements of the RDD instead of re-computing them from the dependencies.

Spark defines levels of persistence or StorageLevel values for persisting RDDs. rdd.cache() is shorthand for rdd.persist(StorageLevel.MEMORY). In the preceding example, joinedRdd is persisted with storage level as MEMORY_AND_DISK which indicates persisting the RDD in memory as well as in disk. It is good practice to un-persist the RDD at the end, which lets us manually remove it from the cache.

There's more…

Spark defines various levels of persistence, such as MEMORY_ONLY, MEMORY_AND_DISK, MEMORY_AND_DISK2, and so on. Deciding when to cache/persist the data can be an art. The decision typically involves trade-offs between space and speed. If you attempt to cache too much data to fit in memory, Spark will use the LRU cache policy to evict old partitions. In general, RDDs should be persisted when they are likely to be referenced by multiple actions and are expensive to regenerate.

See also

Please refer to http://spark.apache.org/docs/latest/programming-guide.html#rdd-persistence to gain a detailed understanding of persistence in Spark.

You have been reading a chapter from
Apache Spark for Data Science Cookbook
Published in: Dec 2016 Publisher: ISBN-13: 9781785880100
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime}