Search icon
Arrow left icon
All Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletters
Free Learning
Arrow right icon
Learning Apache Spark 2

You're reading from  Learning Apache Spark 2

Product type Book
Published in Mar 2017
Publisher Packt
ISBN-13 9781785885136
Pages 356 pages
Edition 1st Edition
Languages

Table of Contents (18) Chapters

Learning Apache Spark 2
Credits
About the Author
About the Reviewers
www.packtpub.com
Customer Feedback
Preface
1. Architecture and Installation 2. Transformations and Actions with Spark RDDs 3. ETL with Spark 4. Spark SQL 5. Spark Streaming 6. Machine Learning with Spark 7. GraphX 8. Operating in Clustered Mode 9. Building a Recommendation System 10. Customer Churn Prediction 1. Theres More with Spark

I/O tuning


I/O is critical when you are reading data to/from disk. The major things that you need to consider are:

Data locality

The closer your computer is to your data, the better the performance for your jobs. Moving compute to your data is more optimal compared to moving data to your compute, and therein lies the concept of data locality. Process and data can be quite close to each other, or in some cases on entirely different nodes. The locality levels are defined as follows:

  • PROCESS_LOCAL: This is the best possible option where data resides in the same JVM as the process, and hence is called local to the process.
  • NODE_LOCAL: This indicates that the data is not in the same JVM, but is on the same node. This provides a fast way to access the data, despite it being slower than PROCESS_LOCAL, since the data has to be transferred from either the disk or another process.
  • RACK_LOCAL: There can be multiple servers in the RACK. This option indicates that the data is on the same rack as the current node.
  • ANY: This indicates that the data is elsewhere on the network but on the same Rack.
  • NO_PREF: No preference is given to the locality of the data and it is accessed quickly from anywhere. This is especially true in cases where Spark is unable to determine preferred locations.

Spark realises the importance of data locality and hence tries its best to schedule tasks with optimum data locality. However, this cannot be guaranteed and hence Spark offers configuration options for you to configure alternate options. Here's what Spark can do:

  • Try to schedule a task on the node where the data is resident.
  • In the case of the node being busy, wait until n number of seconds before trying to configure the task on another node with free CPU.
  • Don't wait - Simply start the task wherever it gets a chance and accepting the performance penalty of data movement.

The relevant data locality configuration options are documented at the following link http://bit.ly/2lRX60u.

Property Name

Default

Description

spark.locality.wait

3s

How long to wait to launch a data local task before giving up and launching it on a less local node. The same wait will be used to step through multiple locality levels (process-local, node-local, rack-local and then any). It is also possible to customize the waiting time for each level by setting spark.locality.wait.node, and so on. You should increase this setting if your tasks are long and see poor locality, but the default usually works well.

spark.locality.wait.node

spark.locality.wait

Customize the locality wait for node locality. For example, you can set this to 0 to skip node locality and search immediately for rack locality (if your cluster has rack information).

spark.locality.wait.process

spark.locality.wait

Customize the locality wait for process locality. This affects tasks that attempt to access cached data in a particular executor process.

spark.locality.wait.rack

spark.locality.wait

Customize the locality wait for rack locality.

Optimum data caching can have a dramatic impact on performance. For a data set that you would need to access multiple times, instead of recreating it from scratch it is best to cache it. However, please don't go overboard with caching: you will be amazed to see how many times we see this in practice. Use this option judiciously.

Data caching goes hand in hand with serialization, so make sure you pick up the best serialization mechanism available for your objects.

lock icon The rest of the chapter is locked
arrow left Previous Chapter
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at €14.99/month. Cancel anytime}