Reader small image

You're reading from  Hands-On Big Data Analytics with PySpark

Product typeBook
Published inMar 2019
Reading LevelExpert
PublisherPackt
ISBN-139781838644130
Edition1st Edition
Languages
Tools
Concepts
Right arrow
Authors (2):
Rudy Lai
Rudy Lai
author image
Rudy Lai

Colibri Digital is a technology consultancy company founded in 2015 by James Cross and Ingrid Funie. The company works to help its clients navigate the rapidly changing and complex world of emerging technologies, with deep expertise in areas such as big data, data science, machine learning, and Cloud computing. Over the past few years, they have worked with some of the World's largest and most prestigious companies, including a tier 1 investment bank, a leading management consultancy group, and one of the World's most popular soft drinks companies, helping each of them to better make sense of its data, and process it in more intelligent ways.The company lives by its motto: Data -> Intelligence -> Action. Rudy Lai is the founder of QuantCopy, a sales acceleration startup using AI to write sales emails for prospects. By taking in leads from your pipelines, QuantCopy researches them online and generates sales emails from that data. It also has a suite of email automation tools to schedule, send, and track email performancekey analytics that all feed back into how our AI generates content. Prior to founding QuantCopy, Rudy ran HighDimension.IO, a machine learning consultancy, where he experienced first-hand the frustrations of outbound sales and prospecting. As a founding partner, he helped startups and enterprises with HighDimension.IO's Machine-Learning-as-a-Service, allowing them to scale up data expertise in the blink of an eye. In the first part of his career, Rudy spent 5+ years in quantitative trading at leading investment banks such as Morgan Stanley. This valuable experience allowed him to witness the power of data, but also the pitfalls of automation using data science and machine learning. Quantitative trading was also a great platform from which you can learn about reinforcement learning and supervised learning topics in depth and in a commercial setting. Rudy holds a Computer Science degree from Imperial College London, where he was part of the Dean's List, and received awards such as the Deutsche Bank Artificial Intelligence prize.
Read more about Rudy Lai

Bartłomiej Potaczek
Bartłomiej Potaczek
author image
Bartłomiej Potaczek

Bartłomiej Potaczek is a software engineer working for Schibsted Tech Polska and programming mostly in JavaScript. He is a big fan of everything related to the react world, functional programming, and data visualization. He founded and created InitLearn, a portal that allows users to learn to program in a pair-programming fashion. He was also involved in InitLearn frontend, which is built on the React-Redux technologies. Besides programming, he enjoys football and crossfit. Currently, he is working on rewriting the frontend for tv.nu—Sweden's most complete TV guide, with over 200 channels. He has also recently worked on technologies including React, React Router, and Redux.
Read more about Bartłomiej Potaczek

View More author details
Right arrow

Testing Apache Spark Jobs

In this chapter, we will test Apache Spark jobs and learn how to separate logic from the Spark engine.

We will first cover unit testing of our code, which will then be used by the integration test in SparkSession. Later, we will be mocking data sources using partial functions, and then learn how to leverage ScalaCheck for property-based testing for a test as well as types in Scala. By the end of this chapter, we will have performed tests in different versions of Spark.

In this chapter, we will be covering the following topics:

  • Separating logic from Spark engine-unit testing
  • Integration testing using SparkSession
  • Mocking data sources using partial functions
  • Using ScalaCheck for property-based testing
  • Testing in different versions of Spark

Separating logic from Spark engine-unit testing

Let's start by separating logic from the Spark engine.

In this section, we will cover the following topics:

  • Creating a component with logic
  • Unit testing of that component
  • Using the case class from the model class for our domain logic

Let's look at the logic first and then the simple test.

So, we have a BonusVerifier object that has only one method, quaifyForBonus, that takes our userTransaction model class. According to our login in the following code, we load user transactions and filter all users that are qualified for a bonus. First, we need to test it to create an RDD and filter it. We need to create a SparkSession and also create data for mocking an RDD or DataFrame, and then test the whole Spark API. Since this involves logic, we will test it in isolation. The logic is as follows:

package com.tomekl007.chapter_6...

Integration testing using SparkSession

Let's now learn about integration testing using SparkSession.

In this section, we will cover the following topics:

  • Leveraging SparkSession for integration testing
  • Using a unit tested component

Here, we are creating the Spark engine. The following line is crucial for the integration test:

 val spark: SparkContext = SparkSession.builder().master("local[2]").getOrCreate().sparkContext

It is not a simple line just to create a lightweight object. SparkSession is a really heavy object and constructing it from scratch is an expensive operation from the perspective of resources and time. Tests such as creating SparkSession will take more time compared to the unit testing from the previous section.

For the same reason, we should use unit tests often to convert all edge cases and use integration testing only for the smaller part of...

Mocking data sources using partial functions

In this section, we will cover the following topics:

  • Creating a Spark component that reads data from Hive
  • Mocking the component
  • Testing the mock component

Let's assume that the following code is our production line:

 ignore("loading data on prod from hive") {
UserDataLogic.loadAndGetAmount(spark, HiveDataLoader.loadUserTransactions)
}

Here, we are using the UserDataLogic.loadAndGetAmount function, which needs to load our user data transaction and get the amount of the transaction. This method takes two arguments. The first argument is a sparkSession and the second argument is the provider of sparkSession, which takes SparkSession and returns DataFrame, as shown in the following example:

object UserDataLogic {
def loadAndGetAmount(sparkSession: SparkSession, provider: SparkSession => DataFrame): DataFrame = {
...

Using ScalaCheck for property-based testing

In this section, we will cover the following topics:

  • Property-based testing
  • Creating a property-based test

Let's look at a simple property-based test. We need to import a dependency before we define properties. We also need a dependency for the ScalaCheck library, which is a library for property-based tests.

In the previous section, every test extended FunSuite. We used functional tests, but we had to provide arguments explicitly. In this example, we're extending Properties from the ScalaCheck library and testing a StringType, as follows:

object PropertyBasedTesting extends Properties("StringType")

Our ScalaCheck will generate a random string for us. If we create a property-based test for a custom type, then that is not known to the ScalaCheck. We need to provide a generator that will generate instances of that...

Testing in different versions of Spark

In this section, we will cover the following topics:

  • Changing the component to work with Spark pre-2.x
  • Mock testing pre-2.x
  • RDD mock testing

Let's start with the mocking data sources from the third section of this chapter—Mocking data sources using partial functions.

Since we were testing UserDataLogic.loadAndGetAmount, notice that everything operates on the DataFrame and thus we had a SparkSession and DataFrame.

Now, let's compare it to the Spark pre-2.x. We can see that this time, we are unable to use DataFrames. Let's assume that the following example shows our logic from the previous Sparks:

test("mock loading data from hive"){
//given
import spark.sqlContext.implicits._
val df = spark.sparkContext
.makeRDD(List(UserTransaction("a", 100), UserTransaction("b", 200)))
.toDF()
.rdd...

Summary

In this chapter, we first learned how to separate logic from the Spark engine. We then looked at a component that was well-tested in separation without the Spark engine, and we carried out integration testing using SparkSession. For this, we created a SparkSession test by reusing the component that was already well-tested. By doing that, we did not have to cover all edge cases in the integration test and our test was much faster. We then learned how to leverage partial functions to supply mocked data that's provided at the testing phase. We also covered ScalaCheck for property-based testing. By the end of this chapter, we had tested our code in different versions of Spark and learned how to change our DataFrame mock test to RDD.

In the next chapter, we will learn how to leverage the Spark GraphX API.

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Hands-On Big Data Analytics with PySpark
Published in: Mar 2019Publisher: PacktISBN-13: 9781838644130
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Authors (2)

author image
Rudy Lai

Colibri Digital is a technology consultancy company founded in 2015 by James Cross and Ingrid Funie. The company works to help its clients navigate the rapidly changing and complex world of emerging technologies, with deep expertise in areas such as big data, data science, machine learning, and Cloud computing. Over the past few years, they have worked with some of the World's largest and most prestigious companies, including a tier 1 investment bank, a leading management consultancy group, and one of the World's most popular soft drinks companies, helping each of them to better make sense of its data, and process it in more intelligent ways.The company lives by its motto: Data -> Intelligence -> Action. Rudy Lai is the founder of QuantCopy, a sales acceleration startup using AI to write sales emails for prospects. By taking in leads from your pipelines, QuantCopy researches them online and generates sales emails from that data. It also has a suite of email automation tools to schedule, send, and track email performancekey analytics that all feed back into how our AI generates content. Prior to founding QuantCopy, Rudy ran HighDimension.IO, a machine learning consultancy, where he experienced first-hand the frustrations of outbound sales and prospecting. As a founding partner, he helped startups and enterprises with HighDimension.IO's Machine-Learning-as-a-Service, allowing them to scale up data expertise in the blink of an eye. In the first part of his career, Rudy spent 5+ years in quantitative trading at leading investment banks such as Morgan Stanley. This valuable experience allowed him to witness the power of data, but also the pitfalls of automation using data science and machine learning. Quantitative trading was also a great platform from which you can learn about reinforcement learning and supervised learning topics in depth and in a commercial setting. Rudy holds a Computer Science degree from Imperial College London, where he was part of the Dean's List, and received awards such as the Deutsche Bank Artificial Intelligence prize.
Read more about Rudy Lai

author image
Bartłomiej Potaczek

Bartłomiej Potaczek is a software engineer working for Schibsted Tech Polska and programming mostly in JavaScript. He is a big fan of everything related to the react world, functional programming, and data visualization. He founded and created InitLearn, a portal that allows users to learn to program in a pair-programming fashion. He was also involved in InitLearn frontend, which is built on the React-Redux technologies. Besides programming, he enjoys football and crossfit. Currently, he is working on rewriting the frontend for tv.nu—Sweden's most complete TV guide, with over 200 channels. He has also recently worked on technologies including React, React Router, and Redux.
Read more about Bartłomiej Potaczek