Search icon
Arrow left icon
All Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletters
Free Learning
Arrow right icon
Scala and Spark for Big Data Analytics

You're reading from  Scala and Spark for Big Data Analytics

Product type Book
Published in Jul 2017
Publisher Packt
ISBN-13 9781785280849
Pages 796 pages
Edition 1st Edition
Languages
Concepts
Authors (2):
Md. Rezaul Karim Md. Rezaul Karim
Profile icon Md. Rezaul Karim
Sridhar Alla Sridhar Alla
Profile icon Sridhar Alla
View More author details

Table of Contents (19) Chapters

Preface Introduction to Scala Object-Oriented Scala Functional Programming Concepts Collection APIs Tackle Big Data – Spark Comes to the Party Start Working with Spark – REPL and RDDs Special RDD Operations Introduce a Little Structure - Spark SQL Stream Me Up, Scotty - Spark Streaming Everything is Connected - GraphX Learning Machine Learning - Spark MLlib and Spark ML My Name is Bayes, Naive Bayes Time to Put Some Order - Cluster Your Data with Spark MLlib Text Analytics Using Spark ML Spark Tuning Time to Go to ClusterLand - Deploying Spark on a Cluster Testing and Debugging Spark PySpark and SparkR

Spark Tuning

"Harpists spend 90 percent of their lives tuning their harps and 10 percent playing out of tune."

- Igor Stravinsky

In this chapter, we will dig deeper into Apache Spark internals and see that while Spark is great in making us feel like we are using just another Scala collection, we don't have to forget that Spark actually runs in a distributed system. Therefore, some extra care should be taken. In a nutshell, the following topics will be covered in this chapter:

  • Monitoring Spark jobs
  • Spark configuration
  • Common mistakes in Spark app development
  • Optimization techniques

Monitoring Spark jobs

Spark provides web UI for monitoring all the jobs running or completed on computing nodes (drivers or executors). In this section, we will discuss in brief how to monitor Spark jobs using Spark web UI with appropriate examples. We will see how to monitor the progress of jobs (including submitted, queued, and running jobs). All the tabs in the Spark web UI will be discussed briefly. Finally, we will discuss the logging procedure in Spark for better tuning.

Spark web interface

The web UI (also known as Spark UI) is the web interface for running Spark applications to monitor the execution of jobs on a web browser such as Firefox or Google Chrome. When a SparkContext launches, a web UI that displays useful...

Spark configuration

There are a number of ways to configure your Spark jobs. In this section, we will discuss these ways. More specifically, according to Spark 2.x release, there are three locations to configure the system:

  • Spark properties
  • Environmental variables
  • Logging

Spark properties

As discussed previously, Spark properties control most of the application-specific parameters and can be set using a SparkConf object of Spark. Alternatively, these parameters can be set through the Java system properties. SparkConf allows you to configure some of the common properties as follows:

setAppName() // App name 
setMaster() // Master URL
setSparkHome() // Set the location where Spark is installed on worker nodes.
setExecutorEnv...

Common mistakes in Spark app development

Common mistakes that happen often are application failure, a slow job that gets stuck due to numerous factors, mistakes in the aggregation, actions or transformations, an exception in the main thread and, of course, Out Of Memory (OOM).

Application failure

Most of the time, application failure happens because one or more stages fail eventually. As discussed earlier in this chapter, Spark jobs comprise several stages. Stages aren't executed independently: for instance, a processing stage can't take place before the relevant input-reading stage. So, suppose that stage 1 executes successfully but stage 2 fails to execute, the whole application fails eventually. This can be shown...

Optimization techniques

There are several aspects of tuning Spark applications toward better optimization techniques. In this section, we will discuss how we can further optimize our Spark applications by applying data serialization by tuning the main memory with better memory management. We can also optimize performance by tuning the data structure in your Scala code while developing Spark applications. The storage, on the other hand, can be maintained well by utilizing serialized RDD storage.

One of the most important aspects is garbage collection, and it's tuning if you have written your Spark application using Java or Scala. We will look at how we can also tune this for optimized performance. For distributed environment- and cluster-based system, a level of parallelism and data locality has to be ensured. Moreover, performance could further be improved by using broadcast...

Summary

In this chapter, we discussed some advanced topics of Spark toward making your Spark job's performance better. We discussed some basic techniques to tune your Spark jobs. We discussed how to monitor your jobs by accessing Spark web UI. We discussed how to set Spark configuration parameters. We also discussed some common mistakes made by Spark users and provided some recommendations. Finally, we discussed some optimization techniques that help tune Spark applications.

In the next chapter, you will see how to test Spark applications and debug to solve most common issues.

lock icon The rest of the chapter is locked
You have been reading a chapter from
Scala and Spark for Big Data Analytics
Published in: Jul 2017 Publisher: Packt ISBN-13: 9781785280849
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime}