You're reading from Data Engineering with Scala and Spark

Product typeBook

Published inJan 2024

PublisherPackt

ISBN-139781804612583

Edition1st Edition

Concepts

Data Engineering

Authors (3):

Eric Tome

Rupam Bhattacharjee

David Radford

View More author details

Data Profiling and Data Quality

As we work with multiple sources of data, it is quite easy for some bad data to pass through if there are no checks in place. This can lead to serious issues in downstream systems that rely on the accuracy of upstream data to build models, run business-critical applications, and so on. To make our data pipelines resilient, it is imperative that we have data quality checks in place to ensure the data being processed meets the requirements imposed by both business as well as downstream applications.

Six primary data quality dimensions can be measured individually and used to improve the data quality:

Completeness: Does your customer dataset that you plan to use for an upcoming marketing campaign have all of the required attributes filled in?
Accuracy: Are the email addresses and phone numbers accurate for your customer records?
Consistency: Is customer data consistent across systems?
Validity: Do your customer records have valid...

Technical requirements

If you have not done so already, we recommend you follow the steps covered in Chapter 2 to install VS Code or IntelliJ along with the required plugins to run Scala programs. You also need to add the following dependency to work with Deequ:

libraryDependencies += 
"com.amazon.deequ" % 
"deequ" % "2.0.4-spark-3.3"

Though you can use data of your choice as you follow along, we will be using the tables we loaded into the MySQL database covered in Chapter 4. If you have not done so already, you may want to review the steps covered there.

Understanding components of Deequ

Deequ provides a lot of features to make data quality checks easy. The following diagram shows the major components:

Figure 7.1 – Components of Deequ

We can observe the following components:

Metrics computation: Deequ calculates metrics for data quality, such as completeness, maximum, and so on. You can directly access the raw metrics computed on the data.
Constraint verification: By defining a set of data quality constraints, Deequ automatically derives the necessary metrics to be computed on the data, ensuring constraint validation.
Constraint suggestion: You have the option to utilize Deequ’s automated constraint suggestion methods to infer valuable constraints or define your own customized data quality constraints.

In the background, Deequ uses Apache Spark for metrics computation, and thus it is fast and efficient. In the upcoming sections, we are going to cover these features...

Performing data analysis

Deequ offers capabilities to generate statistics called metrics on data. For example, we can use Deequ to provide us with the number of records in a dataset, tell us whether a particular column is unique, give us the degree of correlation between columns, and so on. Deequ offers this functionality with case classes such as ApproxCountDistinct, Completeness, Correlation, and so on, defined in the com.amazon.deequ.analyzers package. For a complete list of metrics along with their definitions, please refer to https://aws.amazon.com/blogs/big-data/test-data-quality-at-scale-with-deequ/.

In the following example, we will be using the flight data that we loaded into a MySQL table named flights. We analyze the flights data to check the count of records, whether the airline column contains any NULL value, an approximate distinct count of origin_airport, and so on. The result set is then converted into a dataframe and finally printed on the screen:

package com...

Leveraging automatic constraint suggestion

Deequ provides a powerful feature where it can analyze the data and suggest constraints that can be applied as checks. To see how it works, we will be using the flights data once again. In Chapter 4, we defined an interface to work with databases that we are going to use to create a dataframe. We will then pass the dataframe into ConstraintSuggestionRunner in order for Deequ to suggest constraints.

Here is the complete code for it:

package com.packt.dewithscala.chapter7
import com.packt.dewithscala.utils._
import com.amazon.deequ.suggestions.ConstraintSuggestionResult
import com.amazon.deequ.suggestions.ConstraintSuggestionRunner
import com.amazon.deequ.suggestions.Rules
import org.apache.spark.sql.functions._
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.DataFrame
object ConstraintSuggestion extends App {
  val session: SparkSession = Spark.initSparkSession("de-with-scala")
  val...

Defining constraints

In the previous section, we looked at examples of how Deequ can automatically suggest constraints as well as how we can gather various metrics around data. We will now define the actual constraints that we expect the dataframe to pass. In the following code, we define the following constraints that we expect the flights data to pass:

The airline column should not contain any NULL values
The flight_number column should not contain any NULL values
The cancelled column should contain only 0 or 1
The distance column should not contain any negative value
The cancellation_reason column should contain only A, B, C, or D

If all of the checks pass, then we print data looks good on the console; else, we print the constraint along with the result status.

Here is the code for it.

As a first step, we will create a dataframe using the flights table we loaded in MySQL:

  val session = Spark.initSparkSession("de-with-scala...

Storing metrics using MetricsRepository

Deequ allows us to store the metrics we calculate on a dataframe using MetricsRepository. Deequ provides facilities to create both in-memory and file-based repositories. File-based repositories support local filesystems, Simple Storage Service (S3), and Hadoop Distributed File System (HDFS). Persisting data quality metrics allow us to run analysis to see trends and spot any volatility in the data.

Creating an in-memory repository is simple, as the next example shows:

val inMemoryRepo = new InMemoryMetricsRepository()

Example 7.6

Similarly, we can create a file-based repository as follows:

val fileRepo = FileSystemMetricsRepository(sparkSession, filePath)

Example 7.7

The metrics for each run are stored using a key of type ResultKey. ResultKey is defined as a case class with the following signature:

case class ResultKey(dataSetDate: Long, tags: Map[String, String] = Map.empty)

Example 7.8

Here is an example key of...

Detecting anomalies

Deequ supports anomaly detection in data by using metrics stored in MetricsRepository, which we covered in the previous section. For example, we can create a rule to check whether the number of records has increased by 50% compared to the previous run. If it has, then the check will fail.

To show you how it works, we will use a fictitious scenario where we receive a batch of products to be added to the inventory each day. We want to check whether the number of products we receive on any given day has increased by 50% compared to the last run. For this example, we will use an in-memory repository to store the metrics. As we have done earlier, let’s define the dataframes we will use in this example:

  val session = Spark.initSparkSession("de-with-scala")
  import session.implicits._
  val yesterdayDF = Seq((1, "Product 1", 100), (2, "Product 2", 50)).toDF(
  "product_id",...

Summary

We began this chapter by outlining why it is imperative to have data quality checks in place for any data pipeline. We then introduced the Deequ library developed by Amazon and its various components. Deequ uses Spark at its core, thereby leveraging the distributed processing that comes with it. We then took a deep dive into the various functionalities offered by Deequ, such as the automatic suggestion of constraints, defining constraints, metrics repositories, and so on.

In the next chapter, we are going to look at code health and maintainability, along with test-driven development (TDD), which is vital for a scalable and easily maintainable code base.

The rest of the chapter is locked

You have been reading a chapter from

Data Engineering with Scala and Spark

Published in: Jan 2024Publisher: PacktISBN-13: 9781804612583

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at $15.99/month. Cancel anytime

Authors (3)

Eric Tome

Eric Tome has over 25 years of experience working with data. He has contributed to and led teams that ingested, cleansed, standardized, and prepared data used by business intelligence, data science, and operations teams. He has a background in mathematics and currently works as a senior solutions architect at Databricks, helping customers solve their data and AI challenges.
Read more about Eric Tome

Rupam Bhattacharjee

Rupam Bhattacharjee works as a lead data engineer at IBM. He has architected and developed data pipelines, processing massive structured and unstructured data using Spark and Scala for on-premises Hadoop and K8s clusters on the public cloud. He has a degree in electrical engineering.
Read more about Rupam Bhattacharjee

David Radford

David Radford has worked in big data for over 10 years, with a focus on cloud technologies. He led consulting teams for several years, completing a migration from legacy systems to modern data stacks. He holds a master's degree in computer science and works as a senior solutions architect at Databricks.
Read more about David Radford

Personalised recommendations for you

Based on your interests and search pattern

Et al.

Ever wonder why speech recognition systems don't understand the Scottish accent, or what would happen if an astronaut only ate mac 'n' cheese, or other spurious reflections you'd have at a bar? We did, then collated those deliberations into absurd research articles with fake figures and methodologies inspired by even more fictionally absurd studies.

BookAug 2023230 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages4

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages1

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Mastering Tableau 2023

This book is a comprehensive resource to mastering your Tableau skills and becoming a BI expert. As you progress, you will learn how to build advanced dashboards and improve your storytelling to derive key business insight, as well as make you well-versed with advanced functionalities of Tableau in the business intelligence domain.

BookAug 2023684 pages

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages5

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages2

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

BookOct 2023636 pages5

Modern Data Architecture on AWS

Every organization wants an agile, performant, and cost-effective data platform that meets all their current and future business needs. Purpose-built AWS analytics services and their features play a big part in building such a modern data platform. This book brings to you all the design and architectural patterns that’ll help you achieve this goal.

BookAug 2023420 pages5

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

BookDec 2023240 pages

TinyML Cookbook

With over 70 project-based recipes, the TinyML Cookbook is a practical guide that will help you to get the most out of your microcontrollers. It provides a comprehensive understanding of the theoretical foundations while giving you hands-on experience training ML models for deployment on Arduino Nano 33 BLE Sense, Raspberry Pi Pico, and SparkFun RedBoard Artemis Nano microcontrollers.

BookNov 2023664 pages