You're reading from Serverless ETL and Analytics with AWS Glue

Product type Book

Published in Aug 2022

Publisher Packt

ISBN-13 9781800564985

Pages 434 pages

Edition 1st Edition

Languages

Python

Concepts

Data Analysis

Authors (6):

Vishal Pathak

Subramanya Vajiraya

Noritaka Sekiyama

Tomohiro Tanaka

Albert Quiroga

Ishan Gaur

View More author details

Table of Contents (20) Chapters

Preface

Section 1 – Introduction, Concepts, and the Basics of AWS Glue

Chapter 1: Data Management – Introduction and Concepts

Chapter 2: Introduction to Important AWS Glue Features

Chapter 3: Data Ingestion

Section 2 – Data Preparation, Management, and Security

Chapter 4: Data Preparation

Chapter 5: Data Layouts

Chapter 6: Data Management

Chapter 7: Metadata Management

Chapter 8: Data Security

Chapter 9: Data Sharing

Chapter 10: Data Pipeline Management

Section 3 – Tuning, Monitoring, Data Lake Common Scenarios, and Interesting Edge Cases

Chapter 11: Monitoring

Chapter 12: Tuning, Debugging, and Troubleshooting

Chapter 13: Data Analysis

Chapter 14: Machine Learning Integration

Chapter 15: Architecting Data Lakes for Real-World Scenarios and Edge Cases

Other Books You May Enjoy

Chapter 6: Data Management

In the previous chapter, you learned how to optimize your data layout to accelerate performance in query engines and manage the data optimally to reduce costs. This is a really important topic, but it is just one aspect of a data lake. As the volume of data increases, a data lake is used by different stakeholders – not only data engineers and software engineers but also data analysts, data scientists, and sales and marketing representatives. Sometimes, the original data is not easy to use for these stakeholders because the raw data may not be structured well. To make business decisions based on data quickly and effectively, it is important to manage, clean up, and enrich the data so that these stakeholders can understand the data correctly, find insights from the data without any confusion, correlate them, and drive their business based on data.

In this chapter, you will learn how to manage, clean up, and enrich the data in typical data requirements...

Technical requirements

For this chapter, you will need the following resources:

An AWS account
An AWS IAM role
An Amazon S3 bucket

All the sample code needs to be executed in a Glue runtime (for example, the Glue job system, Glue Interactive Sessions, a Glue Studio notebook, a Glue Docker container, and so on). If you do not have any preferences, we recommend using a Glue Studio notebook so that you can easily start writing code. To use a Glue Studio notebook, follow these steps:

Open the AWS Glue console.
Click AWS Glue Studio.
Click Jobs.
Under Create job, click Jupyter Notebook, then Create.
For Job name, enter your preferred job name.
For IAM Role, choose an IAM role where you have enough permission.
Click Start notebook job.
Wait for the notebook to be started.
Write the necessary code and run the cells on the notebook.

Let’s begin!

Normalizing data

Data normalization is a technique for cleaning data. There are different techniques for normalizing data that make it easy to understand and analyze. This section covers the following techniques and use cases:

Casting data types and map column names
Inferring schemas
Computing schemas on the fly
Enforcing schemas
Flattening nested schemas
Normalizing scale
Handling missing values and outliers
Normalizing date and time values
Handling error records

Let’s dive in!

Casting data types and map column names

In the context of data lakes, there can be a lot of different data sources. This may cause inconsistency in data types or column names. For example, when you want to join multiple tables where there is inconsistency, it can cause query errors or invalid calculations. To avoid such issues and make further analytics easier, it is a good approach to cast the data types and apply mapping to the data during the...

Deduplicating records

When you start analyzing the business data, you may find that it’s incorrect and that there are multiple different notations of the same record.

The following example table contains duplicates:

Figure 6.3 – Customer table with duplicates

As you may have noticed, there are only four unique records in the preceding table. Two records have two different notations, which causes duplication. If you analyze the data with these kinds of duplicated records, the result may include unexpected bias, so you will get an incorrect result.

With AWS Glue, you can use the FindMatches transform to find duplicated records. FindMatches is one of the ETL transforms provided in the Glue ETL library. With the FindMatches transform, you can match records and identify and remove duplicate records based on the ML model.

Let’s look at the end-to-end matching process:

Denormalizing tables

In this section, we will look at an example use case. There is a fictional e-commerce company that sells products and has a website that allows people to buy these products. There are three tables stored in the web system – two dimension tables, product and customer, and one fact table, sales. The product table stores the product’s name, category, and price. The customer table stores individual customer names, email addresses, and phone numbers. These email addresses and phone numbers are sensitive pieces of information that need to be handled carefully. When a customer buys a product, that activity is recorded in the sales table. One new record is inserted into the sales table every time a customer buys a product.

The following is the product dimension table:

Figure 6.5 – Product table

The following code can be used to populate the preceding sample data in a Spark DataFrame:

df_product = spark.createDataFrame...

Securing data content

In the context of a data lake, security is a “job zero” priority. In Chapter 8, Data Security, we will dive deep into security. In this section, we cover basic ETL operations that secure data. The following common techniques can be used to hide confidential values from data:

Masking values
Hashing values

In this section, you will learn how to mask/hash values that are included in your data.

Masking values

In business data lakes, the data can contain sensitive data, such as people’s names, phone numbers, credit card numbers, and so on. Data security is an important aspect of data lakes. There are different approaches to handling such data securely. It is a good idea to just drop the sensitive data when you collect the data from data sources when you won’t use the sensitive data in analytics. It is also common to manage access permissions on certain columns or records of the data. Another approach is to mask the...

Managing data quality

When you build a modern data architecture from different data sources, the incoming data may contain incorrect, missing, or malformed data. This can make data applications fail. It can also result in incorrect business decisions due to incorrect data aggregations. However, it can be hard for you to evaluate the quality of the data if there is no automated mechanism. Today, it is important to manage data quality by applying predefined rules and verifying if the data meets those criteria or not.

Different frameworks can be used to monitor data quality. In this section, we will introduce two mechanisms: AWS Glue DataBrew data quality rules and DeeQu.

AWS Glue DataBrew data quality rules

Glue DataBrew data quality rules allow you to manage data quality to detect typical data issues easily. In this section, we will use a human resources dataset (https://eforexcel.com/wp/downloads-16-sample-csv-files-data-sets-for-testing/).

Follow these steps to manage...

Summary

In this chapter, you learned how to manage, clean up, and enrich your data using various functionalities available on AWS Glue and Apache Spark. In terms of normalizing data, you looked at several techniques, including schema enforcement, timestamp handling, and others. To deduplicate records, you experimented with using ML transforms with a sample dataset, while to denormalize tables, you joined multiple tables and enriched the data to optimize the analytic workload. When learning about masking and hashing values, you performed basic ETL to improve security. Moreover, you learned that Glue PII Detection helps you choose confidential columns dynamically. Finally, you learned how to manage data quality with Glue DataBrew data quality rules and DeeQu.

In the next chapter, you will learn about the best practices for managing metadata on data lakes.