Reader small image

You're reading from  Data Engineering with Scala and Spark

Product typeBook
Published inJan 2024
PublisherPackt
ISBN-139781804612583
Edition1st Edition
Right arrow
Authors (3):
Eric Tome
Eric Tome
author image
Eric Tome

Eric Tome has over 25 years of experience working with data. He has contributed to and led teams that ingested, cleansed, standardized, and prepared data used by business intelligence, data science, and operations teams. He has a background in mathematics and currently works as a senior solutions architect at Databricks, helping customers solve their data and AI challenges.
Read more about Eric Tome

Rupam Bhattacharjee
Rupam Bhattacharjee
author image
Rupam Bhattacharjee

Rupam Bhattacharjee works as a lead data engineer at IBM. He has architected and developed data pipelines, processing massive structured and unstructured data using Spark and Scala for on-premises Hadoop and K8s clusters on the public cloud. He has a degree in electrical engineering.
Read more about Rupam Bhattacharjee

David Radford
David Radford
author image
David Radford

David Radford has worked in big data for over 10 years, with a focus on cloud technologies. He led consulting teams for several years, completing a migration from legacy systems to modern data stacks. He holds a master's degree in computer science and works as a senior solutions architect at Databricks.
Read more about David Radford

View More author details
Right arrow

Data Profiling and Data Quality

As we work with multiple sources of data, it is quite easy for some bad data to pass through if there are no checks in place. This can lead to serious issues in downstream systems that rely on the accuracy of upstream data to build models, run business-critical applications, and so on. To make our data pipelines resilient, it is imperative that we have data quality checks in place to ensure the data being processed meets the requirements imposed by both business as well as downstream applications.

Six primary data quality dimensions can be measured individually and used to improve the data quality:

  • Completeness: Does your customer dataset that you plan to use for an upcoming marketing campaign have all of the required attributes filled in?
  • Accuracy: Are the email addresses and phone numbers accurate for your customer records?
  • Consistency: Is customer data consistent across systems?
  • Validity: Do your customer records have valid...

Technical requirements

If you have not done so already, we recommend you follow the steps covered in Chapter 2 to install VS Code or IntelliJ along with the required plugins to run Scala programs. You also need to add the following dependency to work with Deequ:

libraryDependencies += 
"com.amazon.deequ" % 
"deequ" % "2.0.4-spark-3.3"

Though you can use data of your choice as you follow along, we will be using the tables we loaded into the MySQL database covered in Chapter 4. If you have not done so already, you may want to review the steps covered there.

Understanding components of Deequ

Deequ provides a lot of features to make data quality checks easy. The following diagram shows the major components:

Figure 7.1 – Components of Deequ

Figure 7.1 – Components of Deequ

We can observe the following components:

  • Metrics computation: Deequ calculates metrics for data quality, such as completeness, maximum, and so on. You can directly access the raw metrics computed on the data.
  • Constraint verification: By defining a set of data quality constraints, Deequ automatically derives the necessary metrics to be computed on the data, ensuring constraint validation.
  • Constraint suggestion: You have the option to utilize Deequ’s automated constraint suggestion methods to infer valuable constraints or define your own customized data quality constraints.

In the background, Deequ uses Apache Spark for metrics computation, and thus it is fast and efficient. In the upcoming sections, we are going to cover these features...

Performing data analysis

Deequ offers capabilities to generate statistics called metrics on data. For example, we can use Deequ to provide us with the number of records in a dataset, tell us whether a particular column is unique, give us the degree of correlation between columns, and so on. Deequ offers this functionality with case classes such as ApproxCountDistinct, Completeness, Correlation, and so on, defined in the com.amazon.deequ.analyzers package. For a complete list of metrics along with their definitions, please refer to https://aws.amazon.com/blogs/big-data/test-data-quality-at-scale-with-deequ/.

In the following example, we will be using the flight data that we loaded into a MySQL table named flights. We analyze the flights data to check the count of records, whether the airline column contains any NULL value, an approximate distinct count of origin_airport, and so on. The result set is then converted into a dataframe and finally printed on the screen:

Leveraging automatic constraint suggestion

Deequ provides a powerful feature where it can analyze the data and suggest constraints that can be applied as checks. To see how it works, we will be using the flights data once again. In Chapter 4, we defined an interface to work with databases that we are going to use to create a dataframe. We will then pass the dataframe into ConstraintSuggestionRunner in order for Deequ to suggest constraints.

Here is the complete code for it:

package com.packt.dewithscala.chapter7
import com.packt.dewithscala.utils._
import com.amazon.deequ.suggestions.ConstraintSuggestionResult
import com.amazon.deequ.suggestions.ConstraintSuggestionRunner
import com.amazon.deequ.suggestions.Rules
import org.apache.spark.sql.functions._
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.DataFrame
object ConstraintSuggestion extends App {
  val session: SparkSession = Spark.initSparkSession("de-with-scala")
  val...

Defining constraints

In the previous section, we looked at examples of how Deequ can automatically suggest constraints as well as how we can gather various metrics around data. We will now define the actual constraints that we expect the dataframe to pass. In the following code, we define the following constraints that we expect the flights data to pass:

  • The airline column should not contain any NULL values
  • The flight_number column should not contain any NULL values
  • The cancelled column should contain only 0 or 1
  • The distance column should not contain any negative value
  • The cancellation_reason column should contain only A, B, C, or D

If all of the checks pass, then we print data looks good on the console; else, we print the constraint along with the result status.

Here is the code for it.

As a first step, we will create a dataframe using the flights table we loaded in MySQL:

  val session = Spark.initSparkSession("de-with-scala...

Storing metrics using MetricsRepository

Deequ allows us to store the metrics we calculate on a dataframe using MetricsRepository. Deequ provides facilities to create both in-memory and file-based repositories. File-based repositories support local filesystems, Simple Storage Service (S3), and Hadoop Distributed File System (HDFS). Persisting data quality metrics allow us to run analysis to see trends and spot any volatility in the data.

Creating an in-memory repository is simple, as the next example shows:

val inMemoryRepo = new InMemoryMetricsRepository()

Example 7.6

Similarly, we can create a file-based repository as follows:

val fileRepo = FileSystemMetricsRepository(sparkSession, filePath)

Example 7.7

The metrics for each run are stored using a key of type ResultKey. ResultKey is defined as a case class with the following signature:

case class ResultKey(dataSetDate: Long, tags: Map[String, String] = Map.empty)

Example 7.8

Here is an example key of...

Detecting anomalies

Deequ supports anomaly detection in data by using metrics stored in MetricsRepository, which we covered in the previous section. For example, we can create a rule to check whether the number of records has increased by 50% compared to the previous run. If it has, then the check will fail.

To show you how it works, we will use a fictitious scenario where we receive a batch of products to be added to the inventory each day. We want to check whether the number of products we receive on any given day has increased by 50% compared to the last run. For this example, we will use an in-memory repository to store the metrics. As we have done earlier, let’s define the dataframes we will use in this example:

  val session = Spark.initSparkSession("de-with-scala")
  import session.implicits._
  val yesterdayDF = Seq((1, "Product 1", 100), (2, "Product 2", 50)).toDF(
  "product_id",...

Summary

We began this chapter by outlining why it is imperative to have data quality checks in place for any data pipeline. We then introduced the Deequ library developed by Amazon and its various components. Deequ uses Spark at its core, thereby leveraging the distributed processing that comes with it. We then took a deep dive into the various functionalities offered by Deequ, such as the automatic suggestion of constraints, defining constraints, metrics repositories, and so on.

In the next chapter, we are going to look at code health and maintainability, along with test-driven development (TDD), which is vital for a scalable and easily maintainable code base.

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Data Engineering with Scala and Spark
Published in: Jan 2024Publisher: PacktISBN-13: 9781804612583
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Authors (3)

author image
Eric Tome

Eric Tome has over 25 years of experience working with data. He has contributed to and led teams that ingested, cleansed, standardized, and prepared data used by business intelligence, data science, and operations teams. He has a background in mathematics and currently works as a senior solutions architect at Databricks, helping customers solve their data and AI challenges.
Read more about Eric Tome

author image
Rupam Bhattacharjee

Rupam Bhattacharjee works as a lead data engineer at IBM. He has architected and developed data pipelines, processing massive structured and unstructured data using Spark and Scala for on-premises Hadoop and K8s clusters on the public cloud. He has a degree in electrical engineering.
Read more about Rupam Bhattacharjee

author image
David Radford

David Radford has worked in big data for over 10 years, with a focus on cloud technologies. He led consulting teams for several years, completing a migration from legacy systems to modern data stacks. He holds a master's degree in computer science and works as a senior solutions architect at Databricks.
Read more about David Radford