Reader small image

You're reading from  Apache Spark 2.x Cookbook

Product typeBook
Published inMay 2017
Reading LevelIntermediate
Publisher
ISBN-139781787127265
Edition1st Edition
Languages
Right arrow
Author (1)
Rishi Yadav
Rishi Yadav
author image
Rishi Yadav

Rishi Yadav has 19 years of experience in designing and developing enterprise applications. He is an open source software expert and advises American companies on big data and public cloud trends. Rishi was honored as one of Silicon Valley's 40 under 40 in 2014. He earned his bachelor's degree from the prestigious Indian Institute of Technology, Delhi, in 1998. About 12 years ago, Rishi started InfoObjects, a company that helps data-driven businesses gain new insights into data. InfoObjects combines the power of open source and big data to solve business challenges for its clients and has a special focus on Apache Spark. The company has been on the Inc. 5000 list of the fastest growing companies for 6 years in a row. InfoObjects has also been named the best place to work in the Bay Area in 2014 and 2015. Rishi is an open source contributor and active blogger. This book is dedicated to my parents, Ganesh and Bhagwati Yadav; I would not be where I am without their unconditional support, trust, and providing me the freedom to choose a path of my own. Special thanks go to my life partner, Anjali, for providing immense support and putting up with my long, arduous hours (yet again).Our 9-year-old son, Vedant, and niece, Kashmira, were the unrelenting force behind keeping me and the book on track. Big thanks to InfoObjects' CTO and my business partner, Sudhir Jangir, for providing valuable feedback and also contributing with recipes on enterprise security, a topic he is passionate about; to our SVP, Bart Hickenlooper, for taking the charge in leading the company to the next level; to Tanmoy Chowdhury and Neeraj Gupta for their valuable advice; to Yogesh Chandani, Animesh Chauhan, and Katie Nelson for running operations skillfully so that I could focus on this book; and to our internal review team (especially Rakesh Chandran) for ironing out the kinks. I would also like to thank Marcel Izumi for, as always, providing creative visuals. I cannot miss thanking our dog, Sparky, for giving me company on my long nights out. Last but not least, special thanks to our valuable clients, partners, and employees, who have made InfoObjects the best place to work at and, needless to say, an immensely successful organization.
Read more about Rishi Yadav

Right arrow

Chapter 12. Optimizations and Performance Tuning

This chapter covers various optimization and performance tuning best practices when working with Spark.

The chapter is divided into the following recipes:

  • Optimizing memory
  • Leveraging speculation
  • Optimizing joins
  • Using compression to improve performance
  • Using serialization to improve performance
  • Optimizing level of parallelism
  • Understanding project Tungsten

Optimizing memory


Spark is a complex distributed computing framework and has many moving parts. Various cluster resources, such as memory, CPU, and network bandwidth, can become bottlenecks at various points. As Spark is an in-memory compute framework, the impact of the memory is the biggest.

Another issue is that it is common for Spark applications to use a huge amount of memory, sometimes more than 100 GB. This amount of memory usage is not common in traditional Java applications.

In Spark, there are two places where memory optimization is needed: one at the driver level and the other at the executor level. The following diagram shows the two levels (driver level and executor level) of operations in Spark:

How to do it...

  1. Set the driver memory using the spark-shell command:
        $ spark-shell --drive-memory 8g
  1. Set the driver memory using the spark-submit command:
$ spark-submit --drive-memory 8g
  1. Set the executor memory using the spark-shell command:
$ spark-shell --executor-memory 8g
  1. Set the...

Leveraging speculation


Like MapReduce, Spark uses speculation to spawn additional tasks if it suspects a task is running on a straggler node. A good use case would be to think of a situation when 95 percent or 99 percent of your job finishes really fast and then gets stuck (we have all been there).

How to do it...

There are a few settings you can use to control speculation. The examples are provided only to show how to change values. Mostly, just turning on speculation is good enough:

  1. Setting spark.speculation (the default is false):
$ spark-shell -conf spark.speculation=true
  1. Setting spark.speculation.interval (the default is 100 milliseconds) (denotes the rate at which Spark examines tasks to see whether speculation is needed): 
$ spark-shell -conf spark.speculation.interval=200
  1. Setting spark.speculation.multiplier (the default is 1.5) (denotes how many times a task has to be slower than median to be a candidate for speculation):
$ spark-shell -conf spark.speculation.multiplier=1.5
  1. Setting spark...

Optimizing joins


This topic was covered briefly when discussing Spark SQL, but it is a good idea to discuss it here again as joins are highly responsible for optimization challenges. 

There are primarily three types of joins in Spark:

  • Shuffle hash join (default):
    • Classic map-reduce type join
    • Shuffle both datasets based on output key
    • During reduce, join the datasets for same output key
  • Broadcast hash join:
    • When one dataset is small enough to fit in memory
  • Cartesian join
    • When every row of one table is joined with every row of the other table

The easiest optimization is that if one of the datasets is small enough to fit in memory, it should be broadcast (broadcast join) to every compute node. This use case is very common as data needs to be combined with side data like a dictionary all the time.

Mostly, joins are slow due to too much data being shuffled over the network. 

How to do it...

You can also check which execution strategy is being used using explain:

scala> mydf.explain
scala> mydf.queryExecution...

Using compression to improve performance


Data compression involves encoding information using fewer bits than the original representation. Compression has an important role to play in big data technologies. It makes both storage and transport of data more efficient.

When data is compressed, it becomes smaller, so both disk I/O and network I/O become faster. It also saves storage space. Every optimization has a cost, and the cost of compression comes in the form of added CPU cycles to compress and decompress data.

Hadoop needs to split data to put them into blocks, irrespective of whether the data is compressed or not. Only a few compression formats are splittable.

The two most popular compression formats for big data loads are Lempel-Ziv-Oberhumer (LZO) and Snappy. Snappy is not splittable, while LZO is. Snappy, on the other hand, is a much faster format.

If the compression format is splittable like LZO, the input file is first split into blocks and then compressed. Since compression happened...

Using serialization to improve performance


Serialization plays an important part in distributed computing. There are two persistence (storage) levels that support serializing RDDs:

  • MEMORY_ONLY_SER: This stores RDDs as serialized objects. It will create one byte array per partition.
  • MEMORY_AND_DISK_SER: This is similar to MEMORY_ONLY_SER, but it spills partitions that do not fit in the memory to disk.

How to do it...

The following are the steps to add appropriate persistence levels:

  1. Start the Spark shell:
$ spark-shell
  1. Import the StorageLevel object as enumeration of persistence levels and the implicits associated with it:
scala> import org.apache.spark.storage.StorageLevel._
  1. Create a dataset:
scala> val words = spark.read.textFile("words")
  1. Persist the dataset:
scala> words.persist(MEMORY_ONLY_SER)

Though serialization reduces the memory footprint substantially, it adds extra CPU cycles due to deserialization.

Note

By default, Spark uses Java's serialization. Since the Java serialization is slow...

Optimizing the level of parallelism


Optimizing the level of parallelism is very important to fully utilize the cluster capacity. In the case of HDFS, it means that the number of partitions is the same as the number of input splits, which is mostly the same as the number of blocks. The default block size in HDFS is 128 MB, and that works well in case of Spark as well. 

In this recipe, we will cover different ways to optimize the number of partitions.

How to do it...

Specify the number of partitions when loading a file into RDD with the following steps:

  1. Start the Spark shell:
$ spark-shell
  1. Load the RDD with a custom number of partitions as a second parameter:
scala> sc.textFile("hdfs://localhost:9000/user/hduser/words",10)

Another approach is to change the default parallelism by performing the following steps:

  1. Start the Spark shell with the new value of default parallelism:
$ spark-shell --conf spark.default.parallelism=10

Note

Have the number of partitions two to three times the number of cores to...

Understanding project Tungsten


Project Tungsten, starting with Spark Version 1.4, was the initiative to bring Spark closer to bare metal, which has become a first-class integral feature now. The goal of this project is to substantially improve the memory and CPU efficiency of the Spark applications and push the limits of the underlying hardware.

In distributed systems, conventional wisdom has been to always optimize network I/O as that has been the most scarce and bottlenecked resource. This trend has changed in the last few years. Network bandwidth in the last 5 years has changed from 1 gigabit per second to 10 gigabit per second. In fact, Amazon Web Services is poised to make 40 Gbps standard, and there are already instances available at 20 Gbps. 

On similar lines, the disk bandwidth has increased from 50 MB/s to 500 MB/s, and solid state drives (SSDs) are being deployed more and more. Pruning unneeded input data and predicate push-down have made the speed gains even larger effectively....

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Apache Spark 2.x Cookbook
Published in: May 2017Publisher: ISBN-13: 9781787127265
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Rishi Yadav

Rishi Yadav has 19 years of experience in designing and developing enterprise applications. He is an open source software expert and advises American companies on big data and public cloud trends. Rishi was honored as one of Silicon Valley's 40 under 40 in 2014. He earned his bachelor's degree from the prestigious Indian Institute of Technology, Delhi, in 1998. About 12 years ago, Rishi started InfoObjects, a company that helps data-driven businesses gain new insights into data. InfoObjects combines the power of open source and big data to solve business challenges for its clients and has a special focus on Apache Spark. The company has been on the Inc. 5000 list of the fastest growing companies for 6 years in a row. InfoObjects has also been named the best place to work in the Bay Area in 2014 and 2015. Rishi is an open source contributor and active blogger. This book is dedicated to my parents, Ganesh and Bhagwati Yadav; I would not be where I am without their unconditional support, trust, and providing me the freedom to choose a path of my own. Special thanks go to my life partner, Anjali, for providing immense support and putting up with my long, arduous hours (yet again).Our 9-year-old son, Vedant, and niece, Kashmira, were the unrelenting force behind keeping me and the book on track. Big thanks to InfoObjects' CTO and my business partner, Sudhir Jangir, for providing valuable feedback and also contributing with recipes on enterprise security, a topic he is passionate about; to our SVP, Bart Hickenlooper, for taking the charge in leading the company to the next level; to Tanmoy Chowdhury and Neeraj Gupta for their valuable advice; to Yogesh Chandani, Animesh Chauhan, and Katie Nelson for running operations skillfully so that I could focus on this book; and to our internal review team (especially Rakesh Chandran) for ironing out the kinks. I would also like to thank Marcel Izumi for, as always, providing creative visuals. I cannot miss thanking our dog, Sparky, for giving me company on my long nights out. Last but not least, special thanks to our valuable clients, partners, and employees, who have made InfoObjects the best place to work at and, needless to say, an immensely successful organization.
Read more about Rishi Yadav