I/O is critical when you are reading data to/from disk. The major things that you need to consider are:
The closer your computer is to your data, the better the performance for your jobs. Moving compute to your data is more optimal compared to moving data to your compute, and therein lies the concept of data locality. Process and data can be quite close to each other, or in some cases on entirely different nodes. The locality levels are defined as follows:
PROCESS_LOCAL
: This is the best possible option where data resides in the same JVM as the process, and hence is called local to the process.NODE_LOCAL
: This indicates that the data is not in the same JVM, but is on the same node. This provides a fast way to access the data, despite it being slower thanPROCESS_LOCAL
, since the data has to be transferred from either the disk or another process.RACK_LOCAL
: There can be multiple servers in the RACK. This option indicates that the data is on the same rack as the current node.ANY
: This indicates that the data is elsewhere on the network but on the same Rack.NO_PREF
: No preference is given to the locality of the data and it is accessed quickly from anywhere. This is especially true in cases where Spark is unable to determine preferred locations.
Spark realises the importance of data locality and hence tries its best to schedule tasks with optimum data locality. However, this cannot be guaranteed and hence Spark offers configuration options for you to configure alternate options. Here's what Spark can do:
- Try to schedule a task on the node where the data is resident.
- In the case of the node being busy, wait until n number of seconds before trying to configure the task on another node with free CPU.
- Don't wait - Simply start the task wherever it gets a chance and accepting the performance penalty of data movement.
The relevant data locality configuration options are documented at the following link http://bit.ly/2lRX60u.
Property Name |
Default |
Description |
|
3s |
How long to wait to launch a data local task before giving up and launching it on a less local node. The same wait will be used to step through multiple locality levels (process-local, node-local, rack-local and then any). It is also possible to customize the waiting time for each level by setting |
|
|
Customize the locality wait for node locality. For example, you can set this to 0 to skip node locality and search immediately for rack locality (if your cluster has rack information). |
|
|
Customize the locality wait for process locality. This affects tasks that attempt to access cached data in a particular executor process. |
|
|
Customize the locality wait for rack locality. |
Optimum data caching can have a dramatic impact on performance. For a data set that you would need to access multiple times, instead of recreating it from scratch it is best to cache it. However, please don't go overboard with caching: you will be amazed to see how many times we see this in practice. Use this option judiciously.
Data caching goes hand in hand with serialization, so make sure you pick up the best serialization mechanism available for your objects.