Reader small image

You're reading from  Hands-On Big Data Analytics with PySpark

Product typeBook
Published inMar 2019
Reading LevelExpert
PublisherPackt
ISBN-139781838644130
Edition1st Edition
Languages
Tools
Concepts
Right arrow
Authors (2):
Rudy Lai
Rudy Lai
author image
Rudy Lai

Colibri Digital is a technology consultancy company founded in 2015 by James Cross and Ingrid Funie. The company works to help its clients navigate the rapidly changing and complex world of emerging technologies, with deep expertise in areas such as big data, data science, machine learning, and Cloud computing. Over the past few years, they have worked with some of the World's largest and most prestigious companies, including a tier 1 investment bank, a leading management consultancy group, and one of the World's most popular soft drinks companies, helping each of them to better make sense of its data, and process it in more intelligent ways.The company lives by its motto: Data -> Intelligence -> Action. Rudy Lai is the founder of QuantCopy, a sales acceleration startup using AI to write sales emails for prospects. By taking in leads from your pipelines, QuantCopy researches them online and generates sales emails from that data. It also has a suite of email automation tools to schedule, send, and track email performancekey analytics that all feed back into how our AI generates content. Prior to founding QuantCopy, Rudy ran HighDimension.IO, a machine learning consultancy, where he experienced first-hand the frustrations of outbound sales and prospecting. As a founding partner, he helped startups and enterprises with HighDimension.IO's Machine-Learning-as-a-Service, allowing them to scale up data expertise in the blink of an eye. In the first part of his career, Rudy spent 5+ years in quantitative trading at leading investment banks such as Morgan Stanley. This valuable experience allowed him to witness the power of data, but also the pitfalls of automation using data science and machine learning. Quantitative trading was also a great platform from which you can learn about reinforcement learning and supervised learning topics in depth and in a commercial setting. Rudy holds a Computer Science degree from Imperial College London, where he was part of the Dean's List, and received awards such as the Deutsche Bank Artificial Intelligence prize.
Read more about Rudy Lai

Bartłomiej Potaczek
Bartłomiej Potaczek
author image
Bartłomiej Potaczek

Bartłomiej Potaczek is a software engineer working for Schibsted Tech Polska and programming mostly in JavaScript. He is a big fan of everything related to the react world, functional programming, and data visualization. He founded and created InitLearn, a portal that allows users to learn to program in a pair-programming fashion. He was also involved in InitLearn frontend, which is built on the React-Redux technologies. Besides programming, he enjoys football and crossfit. Currently, he is working on rewriting the frontend for tv.nu—Sweden's most complete TV guide, with over 200 channels. He has also recently worked on technologies including React, React Router, and Redux.
Read more about Bartłomiej Potaczek

View More author details
Right arrow

Avoiding Shuffle and Reducing Operational Expenses

In this chapter, we will learn how to avoid shuffle and reduce the operational expense of our jobs, along with detecting a shuffle in a process. We will then test operations that cause a shuffle in Apache Spark to find out when we should be very careful and which operations we should avoid. Next, we will learn how to change the design of jobs with wide dependencies. After that, we will be using the keyBy() operations to reduce shuffle and, in the last section of this chapter, we'll see how we can use custom partitioning to reduce the shuffle of our data.

In this chapter, we will cover the following topics:

  • Detecting a shuffle in a process
  • Testing operations that cause a shuffle in Apache Spark
  • Changing the design of jobs with wide dependencies
  • Using keyBy() operations to reduce shuffle
  • Using the custom partitioner to reduce...

Detecting a shuffle in a process

In this section, we will learn how to detect a shuffle in a process.

In this section, we will cover the following topics:

  • Loading randomly partitioned data
  • Issuing repartition using a meaningful partition key
  • Understanding how shuffle occurs by explaining a query

We will load randomly partitioned data to see how and where the data is loaded. Next, we will issue a partition using a meaningful partition key. We will then repartition data to the proper executors using the deterministic and meaningful key. In the end, we will explain our queries by using the explain() method and understand the shuffle. Here, we have a very simple test.

We will create a DataFrame with some data. For example, we created an InputRecord with some random UID and user_1, and another input with random ID in user_1, and the last record for user_2. Let's imagine that...

Testing operations that cause a shuffle in Apache Spark

In this section, we will test the operations that cause a shuffle in Apache Spark. We will cover the following topics:

  • Using join for two DataFrames
  • Using two DataFrames that are partitioned differently
  • Testing a join that causes a shuffle

A join is a specific operation that causes shuffle, and we will use it to join our two DataFrames. We will first check whether it causes shuffle and then we will check how to avoid it. To understand this, we will use two DataFrames that are partitioned differently and check the operation of joining two datasets or DataFrames that are not partitioned or partitioned randomly. It will cause shuffle because there is no way to join two datasets with the same partition key if they are on different physical machines.

Before we join the dataset, we need to send them to the same physical machine...

Changing the design of jobs with wide dependencies

In this section, we will change the job that was performing the join on non-partitioned data. We'll be changing the design of jobs with wide dependencies.

In this section, we will cover the following topics:

  • Repartitioning DataFrames using a common partition key
  • Understanding a join with pre-partitioned data
  • Understanding that we avoided shuffle

We will be using the repartition method on the DataFrame using a common partition key. We saw that when issuing a join, repartitioning happens underneath. But often, when using Spark, we want to execute multiple operations on the DataFrame. So, when we perform the join with other datasets, hashPartitioning will need to be executed once again. If we do the partition at the beginning when the data is loaded, we will avoid partitioning again.

Here, we have our example test case,...

Using keyBy() operations to reduce shuffle

In this section, we will use keyBy() operations to reduce shuffle. We will cover the following topics:

  • Loading randomly partitioned data
  • Trying to pre-partition data in a meaningful way
  • Leveraging the keyBy() function

We will load randomly partitioned data, but this time using the RDD API. We will repartition the data in a meaningful way and extract the information that is going on underneath, similar to DataFrame and the Dataset API. We will learn how to leverage the keyBy() function to give our data some structure and to cause the pre-partitioning in the RDD API.

Here is the test we will be using in this section. We are creating two random input records. The first record has a random user ID, user_1, the second one has a random user ID, user_1, and the third one has a random user ID, user_2:

test("Should use keyBy to distribute...

Using a custom partitioner to reduce shuffle

In this section, we will use a custom partitioner to reduce shuffle. We will cover the following topics:

  • Implementing a custom partitioner
  • Using the partitioner with the partitionBy method on Spark
  • Validating that our data was partitioned properly

We will implement a custom partitioner with our custom logic, which will partition the data. It will inform Spark where each record should land and on which executor. We will be using the partitionBy method on Spark. In the end, we will validate that our data was partitioned properly. For the purposes of this test, we are assuming that we have two executors:

import com.tomekl007.UserTransaction
import org.apache.spark.sql.SparkSession
import org.apache.spark.{Partitioner, SparkContext}
import org.scalatest.FunSuite
import org.scalatest.Matchers._

class CustomPartitioner extends FunSuite {
val...

Summary

In this chapter, we learned how to detect shuffle in a process. We covered testing operations that cause a shuffle in Apache Spark. We also learned how to employ partitioning in the RDD. It is important to know how to use the API if partitioned data is needed, because RDD is still widely used, so we use the keyBy operations to reduce shuffle. We also learned how to use the custom partitioner to reduce shuffle.

In the next chapter, we'll learn how to save data in the correct format using the Spark API.

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Hands-On Big Data Analytics with PySpark
Published in: Mar 2019Publisher: PacktISBN-13: 9781838644130
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Authors (2)

author image
Rudy Lai

Colibri Digital is a technology consultancy company founded in 2015 by James Cross and Ingrid Funie. The company works to help its clients navigate the rapidly changing and complex world of emerging technologies, with deep expertise in areas such as big data, data science, machine learning, and Cloud computing. Over the past few years, they have worked with some of the World's largest and most prestigious companies, including a tier 1 investment bank, a leading management consultancy group, and one of the World's most popular soft drinks companies, helping each of them to better make sense of its data, and process it in more intelligent ways.The company lives by its motto: Data -> Intelligence -> Action. Rudy Lai is the founder of QuantCopy, a sales acceleration startup using AI to write sales emails for prospects. By taking in leads from your pipelines, QuantCopy researches them online and generates sales emails from that data. It also has a suite of email automation tools to schedule, send, and track email performancekey analytics that all feed back into how our AI generates content. Prior to founding QuantCopy, Rudy ran HighDimension.IO, a machine learning consultancy, where he experienced first-hand the frustrations of outbound sales and prospecting. As a founding partner, he helped startups and enterprises with HighDimension.IO's Machine-Learning-as-a-Service, allowing them to scale up data expertise in the blink of an eye. In the first part of his career, Rudy spent 5+ years in quantitative trading at leading investment banks such as Morgan Stanley. This valuable experience allowed him to witness the power of data, but also the pitfalls of automation using data science and machine learning. Quantitative trading was also a great platform from which you can learn about reinforcement learning and supervised learning topics in depth and in a commercial setting. Rudy holds a Computer Science degree from Imperial College London, where he was part of the Dean's List, and received awards such as the Deutsche Bank Artificial Intelligence prize.
Read more about Rudy Lai

author image
Bartłomiej Potaczek

Bartłomiej Potaczek is a software engineer working for Schibsted Tech Polska and programming mostly in JavaScript. He is a big fan of everything related to the react world, functional programming, and data visualization. He founded and created InitLearn, a portal that allows users to learn to program in a pair-programming fashion. He was also involved in InitLearn frontend, which is built on the React-Redux technologies. Besides programming, he enjoys football and crossfit. Currently, he is working on rewriting the frontend for tv.nu—Sweden's most complete TV guide, with over 200 channels. He has also recently worked on technologies including React, React Router, and Redux.
Read more about Bartłomiej Potaczek