Packt+ | Advance your knowledge in tech

You're reading from Apache Spark 2.x Cookbook

Product typeBook

Published inMay 2017

Reading LevelIntermediate

Publisher

ISBN-139781787127265

Edition1st Edition

Languages

Scala

Tools

Apache Spark

Concepts

Data Processing

Author (1)

Rishi Yadav

Chapter 12. Optimizations and Performance Tuning

This chapter covers various optimization and performance tuning best practices when working with Spark.

The chapter is divided into the following recipes:

Optimizing memory
Leveraging speculation
Optimizing joins
Using compression to improve performance
Using serialization to improve performance
Optimizing level of parallelism
Understanding project Tungsten

Optimizing memory

Spark is a complex distributed computing framework and has many moving parts. Various cluster resources, such as memory, CPU, and network bandwidth, can become bottlenecks at various points. As Spark is an in-memory compute framework, the impact of the memory is the biggest.

Another issue is that it is common for Spark applications to use a huge amount of memory, sometimes more than 100 GB. This amount of memory usage is not common in traditional Java applications.

In Spark, there are two places where memory optimization is needed: one at the driver level and the other at the executor level. The following diagram shows the two levels (driver level and executor level) of operations in Spark:

How to do it...

Set the driver memory using the spark-shell command:

        $ spark-shell --drive-memory 8g

Set the driver memory using the spark-submit command:

$ spark-submit --drive-memory 8g

Set the executor memory using the spark-shell command:

$ spark-shell --executor-memory 8g

Set the...

Leveraging speculation

Like MapReduce, Spark uses speculation to spawn additional tasks if it suspects a task is running on a straggler node. A good use case would be to think of a situation when 95 percent or 99 percent of your job finishes really fast and then gets stuck (we have all been there).

How to do it...

There are a few settings you can use to control speculation. The examples are provided only to show how to change values. Mostly, just turning on speculation is good enough:

Setting spark.speculation (the default is false):

$ spark-shell -conf spark.speculation=true

Setting spark.speculation.interval (the default is 100 milliseconds) (denotes the rate at which Spark examines tasks to see whether speculation is needed):

$ spark-shell -conf spark.speculation.interval=200

Setting spark.speculation.multiplier (the default is 1.5) (denotes how many times a task has to be slower than median to be a candidate for speculation):

$ spark-shell -conf spark.speculation.multiplier=1.5

Setting spark...

Optimizing joins

This topic was covered briefly when discussing Spark SQL, but it is a good idea to discuss it here again as joins are highly responsible for optimization challenges.

There are primarily three types of joins in Spark:

Shuffle hash join (default):
- Classic map-reduce type join
- Shuffle both datasets based on output key
- During reduce, join the datasets for same output key
Broadcast hash join:
- When one dataset is small enough to fit in memory
Cartesian join
- When every row of one table is joined with every row of the other table

The easiest optimization is that if one of the datasets is small enough to fit in memory, it should be broadcast (broadcast join) to every compute node. This use case is very common as data needs to be combined with side data like a dictionary all the time.

Mostly, joins are slow due to too much data being shuffled over the network.

How to do it...

You can also check which execution strategy is being used using explain:

scala> mydf.explain
scala> mydf.queryExecution...

Using compression to improve performance

Data compression involves encoding information using fewer bits than the original representation. Compression has an important role to play in big data technologies. It makes both storage and transport of data more efficient.

When data is compressed, it becomes smaller, so both disk I/O and network I/O become faster. It also saves storage space. Every optimization has a cost, and the cost of compression comes in the form of added CPU cycles to compress and decompress data.

Hadoop needs to split data to put them into blocks, irrespective of whether the data is compressed or not. Only a few compression formats are splittable.

The two most popular compression formats for big data loads are Lempel-Ziv-Oberhumer (LZO) and Snappy. Snappy is not splittable, while LZO is. Snappy, on the other hand, is a much faster format.

If the compression format is splittable like LZO, the input file is first split into blocks and then compressed. Since compression happened...

Using serialization to improve performance

Serialization plays an important part in distributed computing. There are two persistence (storage) levels that support serializing RDDs:

MEMORY_ONLY_SER: This stores RDDs as serialized objects. It will create one byte array per partition.
MEMORY_AND_DISK_SER: This is similar to MEMORY_ONLY_SER, but it spills partitions that do not fit in the memory to disk.

How to do it...

The following are the steps to add appropriate persistence levels:

Start the Spark shell:

$ spark-shell

Import the StorageLevel object as enumeration of persistence levels and the implicits associated with it:

scala> import org.apache.spark.storage.StorageLevel._

Create a dataset:

scala> val words = spark.read.textFile("words")

Persist the dataset:

scala> words.persist(MEMORY_ONLY_SER)

Though serialization reduces the memory footprint substantially, it adds extra CPU cycles due to deserialization.

Note

By default, Spark uses Java's serialization. Since the Java serialization is slow...

Optimizing the level of parallelism

Optimizing the level of parallelism is very important to fully utilize the cluster capacity. In the case of HDFS, it means that the number of partitions is the same as the number of input splits, which is mostly the same as the number of blocks. The default block size in HDFS is 128 MB, and that works well in case of Spark as well.

In this recipe, we will cover different ways to optimize the number of partitions.

How to do it...

Specify the number of partitions when loading a file into RDD with the following steps:

Start the Spark shell:

$ spark-shell

Load the RDD with a custom number of partitions as a second parameter:

scala> sc.textFile("hdfs://localhost:9000/user/hduser/words",10)

Another approach is to change the default parallelism by performing the following steps:

Start the Spark shell with the new value of default parallelism:

$ spark-shell --conf spark.default.parallelism=10

Note

Have the number of partitions two to three times the number of cores to...

Understanding project Tungsten

Project Tungsten, starting with Spark Version 1.4, was the initiative to bring Spark closer to bare metal, which has become a first-class integral feature now. The goal of this project is to substantially improve the memory and CPU efficiency of the Spark applications and push the limits of the underlying hardware.

In distributed systems, conventional wisdom has been to always optimize network I/O as that has been the most scarce and bottlenecked resource. This trend has changed in the last few years. Network bandwidth in the last 5 years has changed from 1 gigabit per second to 10 gigabit per second. In fact, Amazon Web Services is poised to make 40 Gbps standard, and there are already instances available at 20 Gbps.

On similar lines, the disk bandwidth has increased from 50 MB/s to 500 MB/s, and solid state drives (SSDs) are being deployed more and more. Pruning unneeded input data and predicate push-down have made the speed gains even larger effectively....

The rest of the chapter is locked

You have been reading a chapter from

Apache Spark 2.x Cookbook

Published in: May 2017Publisher: ISBN-13: 9781787127265

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at $15.99/month. Cancel anytime

Author (1)

Rishi Yadav

Rishi Yadav has 19 years of experience in designing and developing enterprise applications. He is an open source software expert and advises American companies on big data and public cloud trends. Rishi was honored as one of Silicon Valley's 40 under 40 in 2014. He earned his bachelor's degree from the prestigious Indian Institute of Technology, Delhi, in 1998. About 12 years ago, Rishi started InfoObjects, a company that helps data-driven businesses gain new insights into data. InfoObjects combines the power of open source and big data to solve business challenges for its clients and has a special focus on Apache Spark. The company has been on the Inc. 5000 list of the fastest growing companies for 6 years in a row. InfoObjects has also been named the best place to work in the Bay Area in 2014 and 2015. Rishi is an open source contributor and active blogger. This book is dedicated to my parents, Ganesh and Bhagwati Yadav; I would not be where I am without their unconditional support, trust, and providing me the freedom to choose a path of my own. Special thanks go to my life partner, Anjali, for providing immense support and putting up with my long, arduous hours (yet again).Our 9-year-old son, Vedant, and niece, Kashmira, were the unrelenting force behind keeping me and the book on track. Big thanks to InfoObjects' CTO and my business partner, Sudhir Jangir, for providing valuable feedback and also contributing with recipes on enterprise security, a topic he is passionate about; to our SVP, Bart Hickenlooper, for taking the charge in leading the company to the next level; to Tanmoy Chowdhury and Neeraj Gupta for their valuable advice; to Yogesh Chandani, Animesh Chauhan, and Katie Nelson for running operations skillfully so that I could focus on this book; and to our internal review team (especially Rakesh Chandran) for ironing out the kinks. I would also like to thank Marcel Izumi for, as always, providing creative visuals. I cannot miss thanking our dog, Sparky, for giving me company on my long nights out. Last but not least, special thanks to our valuable clients, partners, and employees, who have made InfoObjects the best place to work at and, needless to say, an immensely successful organization.
Read more about Rishi Yadav

Other recommended products

Related to this chapter

Learning Spark SQL

In the past year, Apache Spark has been increasingly adopted for development of distributed applications. Spark SQL APIs provides an optimized interface that helps developers build such applications quickly and easily. However, designing web-scale production applications using Spark SQL APIs can be a complex task. Understanding the design and implementation best practices for Spark SQL API based applications before you start your project will help you avoid these problems and ensure that your project is a success. Learning Spark SQL gives an insight into the engineering practices used to design and build real-world Spark-based applications. The hands-on examples will give you the required confidence to work on any future projects you encounter in Spark SQL.

BookSep 2017452 pages

Apache Spark Quick Start Guide

Apache Spark is a ?exible in-memory framework that allows processing of both batch and real-time data. Its unified engine has made it quite popular for big data use cases. This book will help you to quickly get started with Apache Spark 2.0 and write efficient big data applications for a variety of use cases.

BookJan 2019154 pages

Hands-On Data Analysis with Scala

This book will help you perform effective data analysis with Scala using practical examples. You will come across different challenges and their effective solutions for a variety of data processing tasks - be it data exploration, data manipulation, or real-time data analysis using Apache Spark.

BookMay 2019298 pages

Scala and Spark for Big Data Analytics

Over the last few years, Scala has been adopted increasingly, especially in the field of data science and analytics, along with Apache Spark, which is built on Scala and is widely used in the field of analytics. With this book, you’ll learn how to leverage the power of both Scala and Spark to make sense of big data.

BookJul 2017796 pages

Learning Apache Spark 2

Apache Spark is one of the most popular Big Data processing frameworks today, delivering speed, accuracy and real-time results – all in one solution. With this book, you will delve into the world of Apache Spark and learn about the new features introduced in Spark 2, along with the architecture and the associated concepts. A comprehensive guide to Apache Spark 2 for beginners, this book covers everything you need to know to get up and running with Big Data processing, machine learning and stream processing with Apache Spark, and allows you to easily understand each of these concepts through real-world examples.

BookMar 2017356 pages

Machine Learning with Scala Quick Start Guide

Scala as a programming language is a highly scalable integration of object-oriented and functional programming, which makes it easy to build scalable and complex big data applications. This book is a handy guide for machine learning developers and data scientists who want to train effective machine learning models using this popular language.

BookApr 2019220 pages

Mastering Apache Spark 2.x

Apache Spark is an in-memory cluster-based parallel processing system that provides a wide range of functionality like graph processing, machine learning, stream processing and more. This book will familiarize you with the newest features in Apache Spark 2.x, and take you through an exciting journey of complex Big Data processing, analytics, streaming analytics as well as advanced machine learning with Apache Spark. During the course of the book, you will leverage different functionalities and modules of Apache Spark such as Spark SQL, Spark MLlib, Spark Streaming, SparkML and more, to build efficient data processing solutions. By the end of this book, you will have all the necessary knowledge to use Apache Spark effectively in your day to day tasks.

BookJul 2017354 pages

Hands-On Big Data Analytics with PySpark

In this book, you'll learn to implement some practical and proven techniques to improve aspects of programming and administration in Apache Spark. Techniques are demonstrated using practical examples and best practices. You will also learn how to use Spark and its Python API to create performant analytics with large-scale data.

BookMar 2019182 pages

Apache Spark 2.x for Java Developers

Apache Spark is the buzzword in the big data industry right now, especially with the increasing need for real-time streaming and data processing. While Spark is built on Scala, the Spark Java API exposes all the Spark features available in the Scala version for Java developers. This book will show you how you can implement various functionalities of the Apache Spark framework in Java, without stepping out of your comfort zone.

BookJul 2017350 pages

Learning Apache Flink

BookFeb 2017280 pages

PySpark Cookbook

This cookbook presents recipes on leveraging the power of Python and putting it to use in the Apache Spark ecosystem. By the end of this book, you will be able to solve any problem associated with building effective, data-intensive applications and performing machine learning and structured streaming using PySpark.

BookJun 2018330 pages

Learning PySpark

This book will get you to grips with the Spark Python API. You’ll explore how Python can be used with Spark to build scalable and reliable data-intensive applications.

BookFeb 2017274 pages

Personalised recommendations for you

Based on your interests and search pattern

Et al.

Ever wonder why speech recognition systems don't understand the Scottish accent, or what would happen if an astronaut only ate mac 'n' cheese, or other spurious reflections you'd have at a bar? We did, then collated those deliberations into absurd research articles with fake figures and methodologies inspired by even more fictionally absurd studies.

BookAug 2023230 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages4

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages1

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Mastering Tableau 2023

This book is a comprehensive resource to mastering your Tableau skills and becoming a BI expert. As you progress, you will learn how to build advanced dashboards and improve your storytelling to derive key business insight, as well as make you well-versed with advanced functionalities of Tableau in the business intelligence domain.

BookAug 2023684 pages

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages5

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages2

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

BookOct 2023636 pages5

Modern Data Architecture on AWS

Every organization wants an agile, performant, and cost-effective data platform that meets all their current and future business needs. Purpose-built AWS analytics services and their features play a big part in building such a modern data platform. This book brings to you all the design and architectural patterns that’ll help you achieve this goal.

BookAug 2023420 pages5

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

BookDec 2023240 pages

TinyML Cookbook

With over 70 project-based recipes, the TinyML Cookbook is a practical guide that will help you to get the most out of your microcontrollers. It provides a comprehensive understanding of the theoretical foundations while giving you hands-on experience training ML models for deployment on Arduino Nano 33 BLE Sense, Raspberry Pi Pico, and SparkFun RedBoard Artemis Nano microcontrollers.

BookNov 2023664 pages