Reader small image

You're reading from  Hands-On Big Data Analytics with PySpark

Product typeBook
Published inMar 2019
Reading LevelExpert
PublisherPackt
ISBN-139781838644130
Edition1st Edition
Languages
Tools
Concepts
Right arrow
Authors (2):
Rudy Lai
Rudy Lai
author image
Rudy Lai

Colibri Digital is a technology consultancy company founded in 2015 by James Cross and Ingrid Funie. The company works to help its clients navigate the rapidly changing and complex world of emerging technologies, with deep expertise in areas such as big data, data science, machine learning, and Cloud computing. Over the past few years, they have worked with some of the World's largest and most prestigious companies, including a tier 1 investment bank, a leading management consultancy group, and one of the World's most popular soft drinks companies, helping each of them to better make sense of its data, and process it in more intelligent ways.The company lives by its motto: Data -> Intelligence -> Action. Rudy Lai is the founder of QuantCopy, a sales acceleration startup using AI to write sales emails for prospects. By taking in leads from your pipelines, QuantCopy researches them online and generates sales emails from that data. It also has a suite of email automation tools to schedule, send, and track email performancekey analytics that all feed back into how our AI generates content. Prior to founding QuantCopy, Rudy ran HighDimension.IO, a machine learning consultancy, where he experienced first-hand the frustrations of outbound sales and prospecting. As a founding partner, he helped startups and enterprises with HighDimension.IO's Machine-Learning-as-a-Service, allowing them to scale up data expertise in the blink of an eye. In the first part of his career, Rudy spent 5+ years in quantitative trading at leading investment banks such as Morgan Stanley. This valuable experience allowed him to witness the power of data, but also the pitfalls of automation using data science and machine learning. Quantitative trading was also a great platform from which you can learn about reinforcement learning and supervised learning topics in depth and in a commercial setting. Rudy holds a Computer Science degree from Imperial College London, where he was part of the Dean's List, and received awards such as the Deutsche Bank Artificial Intelligence prize.
Read more about Rudy Lai

Bartłomiej Potaczek
Bartłomiej Potaczek
author image
Bartłomiej Potaczek

Bartłomiej Potaczek is a software engineer working for Schibsted Tech Polska and programming mostly in JavaScript. He is a big fan of everything related to the react world, functional programming, and data visualization. He founded and created InitLearn, a portal that allows users to learn to program in a pair-programming fashion. He was also involved in InitLearn frontend, which is built on the React-Redux technologies. Besides programming, he enjoys football and crossfit. Currently, he is working on rewriting the frontend for tv.nu—Sweden's most complete TV guide, with over 200 channels. He has also recently worked on technologies including React, React Router, and Redux.
Read more about Bartłomiej Potaczek

View More author details
Right arrow

Immutable Design

In this chapter, we will look at the immutable design of Apache Spark. We will delve into the Spark RDD's parent/child chain and use RDD in an immutable way. We will then use DataFrame operations for transformations to discuss immutability in a highly concurrent environment. By the end of this chapter, we will use the Dataset API in an immutable way.

In this chapter, we will cover the following topics:

  • Delving into the Spark RDD's parent/child chain
  • Using RDD in an immutable way
  • Using DataFrame operations to transform
  • Immutability in the highly concurrent environment
  • Using the Dataset API in an immutable way

Delving into the Spark RDD's parent/child chain

In this section, we will try to implement our own RDD that inherits the parent properties of RDD.

We will go through the following topics:

  • Extending an RDD
  • Chaining a new RDD with the parent
  • Testing our custom RDD

Extending an RDD

This is a simple test that has a lot of hidden complexity. Let's start by creating a list of the record, as shown in the following code block:

class InheritanceRdd extends FunSuite {
val spark: SparkContext = SparkSession
.builder().master("local[2]").getOrCreate().sparkContext

test("use extended RDD") {
//given
val rdd = spark.makeRDD(List(Record(1, "d1")))

The Record is just a case class that...

Using RDD in an immutable way

Now that we know how to create a chain of execution using RDD inheritance, let's learn how to use RDD in an immutable way.

In this section, we will go through the following topics:

  • Understating DAG immutability
  • Creating two leaves from the one root RDD
  • Examining results from both leaves

Let's first understand directed acyclic graph immutability and what it gives us. We will then be creating two leaves from one node RDD, and checking if both leaves are behaving totally independently if we create a transformation on one of the leaf RDD's. We will then examine results from both leaves of our current RDD and check if any transformation on any leaf does not change or impact the root RDD. It is imperative to work like this because we have found that we will not be able to create yet another leaf from the root RDD, because the root RDD will...

Using DataFrame operations to transform

The data from the API has an RDD underneath it, and so there is no way that the DataFrame could be mutable. In DataFrame, the immutability is even better because we can add and subtract columns from it dynamically, without changing the source dataset.

In this section, we will cover the following topics:

  • Understanding DataFrame immutability
  • Creating two leaves from the one root DataFrame
  • Adding a new column by issuing transformation

We will start by using data from operations to transform our DataFrame. First, we need to understand DataFrame immutability and then we will create two leaves, but this time from the one root DataFrame. We will then issue a transformation that is a bit different than the RDD. This will add a new column to our resulting DataFrame because we are manipulating it this way in a DataFrame. If we want to map data,...

Immutability in the highly concurrent environment

We saw how immutability affects the creation and design of programs, so now we will understand how it is useful.

In this section, we will cover the following topics:

  • The cons of mutable collections
  • Creating two threads that simultaneously modify a mutable collection
  • Reasoning about a concurrent program

Let's first understand the cause of mutable collections. To do this, we will be creating two threads that simultaneously modify the mutable collection. We will be using this code for our test. First, we will create a ListBuffer that is a mutable list. Then, we can add and delete links without creating another list for any modification. We can then create an Executors service with two threads. We need two threads to start simultaneously to modify the state. Later, we will use a CountDownLatch construct from Java.util.concurrent...

Using the Dataset API in an immutable way

In this section, we will use the Dataset API in an immutable way. We will cover the following topics:

  • Dataset immutability
  • Creating two leaves from the one root dataset
  • Adding a new column by issuing transformation

The test case for the dataset is quite similar, but we need to do a toDS() for our data to be type safe. The type of dataset is userData, as shown in the following example:

import com.tomekl007.UserData
import org.apache.spark.sql.SparkSession
import org.scalatest.FunSuite

class ImmutableDataSet extends FunSuite {
val spark: SparkSession = SparkSession
.builder().master("local[2]").getOrCreate()

test("Should use immutable DF API") {
import spark.sqlContext.implicits._
//given
val userData =
spark.sparkContext.makeRDD(List(
UserData("a", "1"),
UserData("b", "2"),
UserData...

Summary

In this chapter, we delved into the Spark RDD parent-child chain and created a multiplier RDD that was able to calculate everything based on the parent RDD, and also based on the partitioning scheme on the parent. We used RDD in an immutable way. We saw that the modification of the leaf that was created from the parent didn't modify the part. We also learned a better abstraction, that is, a DataFrame, so we learned that we can employ transformation there. However, every transformation is just adding to another column—it is not modifying anything in place. Next, we just set immutability in a highly concurrent environment. We saw how the mutable state is bad when accessing multiple threads. Finally, we saw that the Dataset API is also created in an immutable type of way and that we can leverage those things here.

In the next chapter, we'll look at how to...

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Hands-On Big Data Analytics with PySpark
Published in: Mar 2019Publisher: PacktISBN-13: 9781838644130
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Authors (2)

author image
Rudy Lai

Colibri Digital is a technology consultancy company founded in 2015 by James Cross and Ingrid Funie. The company works to help its clients navigate the rapidly changing and complex world of emerging technologies, with deep expertise in areas such as big data, data science, machine learning, and Cloud computing. Over the past few years, they have worked with some of the World's largest and most prestigious companies, including a tier 1 investment bank, a leading management consultancy group, and one of the World's most popular soft drinks companies, helping each of them to better make sense of its data, and process it in more intelligent ways.The company lives by its motto: Data -> Intelligence -> Action. Rudy Lai is the founder of QuantCopy, a sales acceleration startup using AI to write sales emails for prospects. By taking in leads from your pipelines, QuantCopy researches them online and generates sales emails from that data. It also has a suite of email automation tools to schedule, send, and track email performancekey analytics that all feed back into how our AI generates content. Prior to founding QuantCopy, Rudy ran HighDimension.IO, a machine learning consultancy, where he experienced first-hand the frustrations of outbound sales and prospecting. As a founding partner, he helped startups and enterprises with HighDimension.IO's Machine-Learning-as-a-Service, allowing them to scale up data expertise in the blink of an eye. In the first part of his career, Rudy spent 5+ years in quantitative trading at leading investment banks such as Morgan Stanley. This valuable experience allowed him to witness the power of data, but also the pitfalls of automation using data science and machine learning. Quantitative trading was also a great platform from which you can learn about reinforcement learning and supervised learning topics in depth and in a commercial setting. Rudy holds a Computer Science degree from Imperial College London, where he was part of the Dean's List, and received awards such as the Deutsche Bank Artificial Intelligence prize.
Read more about Rudy Lai

author image
Bartłomiej Potaczek

Bartłomiej Potaczek is a software engineer working for Schibsted Tech Polska and programming mostly in JavaScript. He is a big fan of everything related to the react world, functional programming, and data visualization. He founded and created InitLearn, a portal that allows users to learn to program in a pair-programming fashion. He was also involved in InitLearn frontend, which is built on the React-Redux technologies. Besides programming, he enjoys football and crossfit. Currently, he is working on rewriting the frontend for tv.nu—Sweden's most complete TV guide, with over 200 channels. He has also recently worked on technologies including React, React Router, and Redux.
Read more about Bartłomiej Potaczek