Spark Tuning

"Harpists spend 90 percent of their lives tuning their harps and 10 percent playing out of tune."

- Igor Stravinsky

In this chapter, we will dig deeper into Apache Spark internals and see that while Spark is great in making us feel like we are using just another Scala collection, we don't have to forget that Spark actually runs in a distributed system. Therefore, some extra care should be taken. In a nutshell, the following topics will be covered in this chapter:

Monitoring Spark jobs
Spark configuration
Common mistakes in Spark app development
Optimization techniques

Monitoring Spark jobs

Spark provides web UI for monitoring all the jobs running or completed on computing nodes (drivers or executors). In this section, we will discuss in brief how to monitor Spark jobs using Spark web UI with appropriate examples. We will see how to monitor the progress of jobs (including submitted, queued, and running jobs). All the tabs in the Spark web UI will be discussed briefly. Finally, we will discuss the logging procedure in Spark for better tuning.

Spark web interface

The web UI (also known as Spark UI) is the web interface for running Spark applications to monitor the execution of jobs on a web browser such as Firefox or Google Chrome. When a SparkContext launches, a web UI that displays useful...

Spark configuration

There are a number of ways to configure your Spark jobs. In this section, we will discuss these ways. More specifically, according to Spark 2.x release, there are three locations to configure the system:

Spark properties
Environmental variables
Logging

Spark properties

As discussed previously, Spark properties control most of the application-specific parameters and can be set using a SparkConf object of Spark. Alternatively, these parameters can be set through the Java system properties. SparkConf allows you to configure some of the common properties as follows:

setAppName() // App name 
setMaster() // Master URL 
setSparkHome() // Set the location where Spark is installed on worker nodes. 
setExecutorEnv...

Common mistakes in Spark app development

Common mistakes that happen often are application failure, a slow job that gets stuck due to numerous factors, mistakes in the aggregation, actions or transformations, an exception in the main thread and, of course, Out Of Memory (OOM).

Application failure

Most of the time, application failure happens because one or more stages fail eventually. As discussed earlier in this chapter, Spark jobs comprise several stages. Stages aren't executed independently: for instance, a processing stage can't take place before the relevant input-reading stage. So, suppose that stage 1 executes successfully but stage 2 fails to execute, the whole application fails eventually. This can be shown...

Optimization techniques

There are several aspects of tuning Spark applications toward better optimization techniques. In this section, we will discuss how we can further optimize our Spark applications by applying data serialization by tuning the main memory with better memory management. We can also optimize performance by tuning the data structure in your Scala code while developing Spark applications. The storage, on the other hand, can be maintained well by utilizing serialized RDD storage.

One of the most important aspects is garbage collection, and it's tuning if you have written your Spark application using Java or Scala. We will look at how we can also tune this for optimized performance. For distributed environment- and cluster-based system, a level of parallelism and data locality has to be ensured. Moreover, performance could further be improved by using broadcast...

Summary

In this chapter, we discussed some advanced topics of Spark toward making your Spark job's performance better. We discussed some basic techniques to tune your Spark jobs. We discussed how to monitor your jobs by accessing Spark web UI. We discussed how to set Spark configuration parameters. We also discussed some common mistakes made by Spark users and provided some recommendations. Finally, we discussed some optimization techniques that help tune Spark applications.

In the next chapter, you will see how to test Spark applications and debug to solve most common issues.