Packt+ | Advance your knowledge in tech

You're reading from Learning Apache Spark 2

Product type Book

Published in Mar 2017

Publisher Packt

ISBN-13 9781785885136

Pages 356 pages

Edition 1st Edition

Languages

Python

Concepts

Data Processing

Table of Contents (18) Chapters

Learning Apache Spark 2

Credits

About the Author

About the Reviewers

www.packtpub.com

Customer Feedback

Preface

1. Architecture and Installation

2. Transformations and Actions with Spark RDDs

3. ETL with Spark

4. Spark SQL

5. Spark Streaming

6. Machine Learning with Spark

7. GraphX

8. Operating in Clustered Mode

9. Building a Recommendation System

10. Customer Churn Prediction

1. Theres More with Spark

I/O tuning

I/O is critical when you are reading data to/from disk. The major things that you need to consider are:

Data locality

The closer your computer is to your data, the better the performance for your jobs. Moving compute to your data is more optimal compared to moving data to your compute, and therein lies the concept of data locality. Process and data can be quite close to each other, or in some cases on entirely different nodes. The locality levels are defined as follows:

PROCESS_LOCAL: This is the best possible option where data resides in the same JVM as the process, and hence is called local to the process.
NODE_LOCAL: This indicates that the data is not in the same JVM, but is on the same node. This provides a fast way to access the data, despite it being slower than PROCESS_LOCAL, since the data has to be transferred from either the disk or another process.
RACK_LOCAL: There can be multiple servers in the RACK. This option indicates that the data is on the same rack as the current node.
ANY: This indicates that the data is elsewhere on the network but on the same Rack.
NO_PREF: No preference is given to the locality of the data and it is accessed quickly from anywhere. This is especially true in cases where Spark is unable to determine preferred locations.

Spark realises the importance of data locality and hence tries its best to schedule tasks with optimum data locality. However, this cannot be guaranteed and hence Spark offers configuration options for you to configure alternate options. Here's what Spark can do:

Try to schedule a task on the node where the data is resident.
In the case of the node being busy, wait until n number of seconds before trying to configure the task on another node with free CPU.
Don't wait - Simply start the task wherever it gets a chance and accepting the performance penalty of data movement.

The relevant data locality configuration options are documented at the following link http://bit.ly/2lRX60u.

Property Name	Default	Description
`spark.locality.wait`	3s	How long to wait to launch a data local task before giving up and launching it on a less local node. The same wait will be used to step through multiple locality levels (process-local, node-local, rack-local and then any). It is also possible to customize the waiting time for each level by setting `spark.locality.wait.node`, and so on. You should increase this setting if your tasks are long and see poor locality, but the default usually works well.
`spark.locality.wait.node`	`spark.locality.wait`	Customize the locality wait for node locality. For example, you can set this to 0 to skip node locality and search immediately for rack locality (if your cluster has rack information).
`spark.locality.wait.process`	`spark.locality.wait`	Customize the locality wait for process locality. This affects tasks that attempt to access cached data in a particular executor process.
`spark.locality.wait.rack`	`spark.locality.wait`	Customize the locality wait for rack locality.

Optimum data caching can have a dramatic impact on performance. For a data set that you would need to access multiple times, instead of recreating it from scratch it is best to cache it. However, please don't go overboard with caching: you will be amazed to see how many times we see this in practice. Use this option judiciously.

Data caching goes hand in hand with serialization, so make sure you pick up the best serialization mechanism available for your objects.