Reader small image

You're reading from  Serverless ETL and Analytics with AWS Glue

Product typeBook
Published inAug 2022
Reading LevelExpert
PublisherPackt
ISBN-139781800564985
Edition1st Edition
Languages
Right arrow
Authors (6):
Vishal Pathak
Vishal Pathak
author image
Vishal Pathak

Vishal Pathak is a Data Lab Solutions Architect at AWS. Vishal works with customers on their use cases, architects solutions to solve their business problems, and helps them build scalable prototypes. Prior to his journey in AWS, Vishal helped customers implement business intelligence, data warehouse, and data lake projects in the US and Australia.
Read more about Vishal Pathak

Subramanya Vajiraya
Subramanya Vajiraya
author image
Subramanya Vajiraya

Subramanya Vajiraya is a Big data Cloud Engineer at AWS Sydney specializing in AWS Glue. He obtained his Bachelor of Engineering degree specializing in Information Science & Engineering from NMAM Institute of Technology, Nitte, KA, India (Visvesvaraya Technological University, Belgaum) in 2015 and obtained his Master of Information Technology degree specialized in Internetworking from the University of New South Wales, Sydney, Australia in 2017. He is passionate about helping customers solve challenging technical issues related to their ETL workload and implementing scalable data integration and analytics pipelines on AWS.
Read more about Subramanya Vajiraya

Noritaka Sekiyama
Noritaka Sekiyama
author image
Noritaka Sekiyama

Noritaka Sekiyama is a Senior Big Data Architect on the AWS Glue and AWS Lake Formation team. He has 11 years of experience working in the software industry. Based in Tokyo, Japan, he is responsible for implementing software artifacts, building libraries, troubleshooting complex issues and helping guide customer architectures
Read more about Noritaka Sekiyama

Tomohiro Tanaka
Tomohiro Tanaka
author image
Tomohiro Tanaka

Tomohiro Tanaka is a senior cloud support engineer at AWS. He works to help customers solve their issues and build data lakes across AWS Glue, AWS IoT, and big data technologies such Apache Spark, Hadoop, and Iceberg.
Read more about Tomohiro Tanaka

Albert Quiroga
Albert Quiroga
author image
Albert Quiroga

Albert Quiroga works as a senior solutions architect at Amazon, where he is helping to design and architect one of the largest data lakes in the world. Prior to that, he spent four years working at AWS, where he specialized in big data technologies such as EMR and Athena, and where he became an expert on AWS Glue. Albert has worked with several Fortune 500 companies on some of the largest data lakes in the world and has helped to launch and develop features for several AWS services.
Read more about Albert Quiroga

Ishan Gaur
Ishan Gaur
author image
Ishan Gaur

Ishan Gaur has more than 13 years of IT experience in soft ware development and data engineering, building distributed systems and highly scalable ETL pipelines using Apache Spark, Scala, and various ETL tools such as Ab Initio and Datastage. He currently works at AWS as a senior big data cloud engineer and is an SME of AWS Glue. He is responsible for helping customers to build out large, scalable distributed systems and implement them in AWS cloud environments using various big data services, including EMR, Glue, and Athena, as well as other technologies, such as Apache Spark, Hadoop, and Hive.
Read more about Ishan Gaur

View More author details
Right arrow

Chapter 6: Data Management

In the previous chapter, you learned how to optimize your data layout to accelerate performance in query engines and manage the data optimally to reduce costs. This is a really important topic, but it is just one aspect of a data lake. As the volume of data increases, a data lake is used by different stakeholders – not only data engineers and software engineers but also data analysts, data scientists, and sales and marketing representatives. Sometimes, the original data is not easy to use for these stakeholders because the raw data may not be structured well. To make business decisions based on data quickly and effectively, it is important to manage, clean up, and enrich the data so that these stakeholders can understand the data correctly, find insights from the data without any confusion, correlate them, and drive their business based on data.

In this chapter, you will learn how to manage, clean up, and enrich the data in typical data requirements...

Technical requirements

For this chapter, you will need the following resources:

  • An AWS account
  • An AWS IAM role
  • An Amazon S3 bucket

All the sample code needs to be executed in a Glue runtime (for example, the Glue job system, Glue Interactive Sessions, a Glue Studio notebook, a Glue Docker container, and so on). If you do not have any preferences, we recommend using a Glue Studio notebook so that you can easily start writing code. To use a Glue Studio notebook, follow these steps:

  1. Open the AWS Glue console.
  2. Click AWS Glue Studio.
  3. Click Jobs.
  4. Under Create job, click Jupyter Notebook, then Create.
  5. For Job name, enter your preferred job name.
  6. For IAM Role, choose an IAM role where you have enough permission.
  7. Click Start notebook job.
  8. Wait for the notebook to be started.
  9. Write the necessary code and run the cells on the notebook.

Let’s begin!

Normalizing data

Data normalization is a technique for cleaning data. There are different techniques for normalizing data that make it easy to understand and analyze. This section covers the following techniques and use cases:

  • Casting data types and map column names
  • Inferring schemas
  • Computing schemas on the fly
  • Enforcing schemas
  • Flattening nested schemas
  • Normalizing scale
  • Handling missing values and outliers
  • Normalizing date and time values
  • Handling error records

Let’s dive in!

Casting data types and map column names

In the context of data lakes, there can be a lot of different data sources. This may cause inconsistency in data types or column names. For example, when you want to join multiple tables where there is inconsistency, it can cause query errors or invalid calculations. To avoid such issues and make further analytics easier, it is a good approach to cast the data types and apply mapping to the data during the...

Deduplicating records

When you start analyzing the business data, you may find that it’s incorrect and that there are multiple different notations of the same record.

The following example table contains duplicates:

Figure 6.3 – Customer table with duplicates

As you may have noticed, there are only four unique records in the preceding table. Two records have two different notations, which causes duplication. If you analyze the data with these kinds of duplicated records, the result may include unexpected bias, so you will get an incorrect result.

With AWS Glue, you can use the FindMatches transform to find duplicated records. FindMatches is one of the ETL transforms provided in the Glue ETL library. With the FindMatches transform, you can match records and identify and remove duplicate records based on the ML model.

Let’s look at the end-to-end matching process:

  1. Register a table definition for your data in AWS Glue Data...

Denormalizing tables

In this section, we will look at an example use case. There is a fictional e-commerce company that sells products and has a website that allows people to buy these products. There are three tables stored in the web system – two dimension tables, product and customer, and one fact table, sales. The product table stores the product’s name, category, and price. The customer table stores individual customer names, email addresses, and phone numbers. These email addresses and phone numbers are sensitive pieces of information that need to be handled carefully. When a customer buys a product, that activity is recorded in the sales table. One new record is inserted into the sales table every time a customer buys a product.

The following is the product dimension table:

Figure 6.5 – Product table

The following code can be used to populate the preceding sample data in a Spark DataFrame:

df_product = spark.createDataFrame...

Securing data content

In the context of a data lake, security is a “job zero” priority. In Chapter 8, Data Security, we will dive deep into security. In this section, we cover basic ETL operations that secure data. The following common techniques can be used to hide confidential values from data:

  • Masking values
  • Hashing values

In this section, you will learn how to mask/hash values that are included in your data.

Masking values

In business data lakes, the data can contain sensitive data, such as people’s names, phone numbers, credit card numbers, and so on. Data security is an important aspect of data lakes. There are different approaches to handling such data securely. It is a good idea to just drop the sensitive data when you collect the data from data sources when you won’t use the sensitive data in analytics. It is also common to manage access permissions on certain columns or records of the data. Another approach is to mask the...

Managing data quality

When you build a modern data architecture from different data sources, the incoming data may contain incorrect, missing, or malformed data. This can make data applications fail. It can also result in incorrect business decisions due to incorrect data aggregations. However, it can be hard for you to evaluate the quality of the data if there is no automated mechanism. Today, it is important to manage data quality by applying predefined rules and verifying if the data meets those criteria or not.

Different frameworks can be used to monitor data quality. In this section, we will introduce two mechanisms: AWS Glue DataBrew data quality rules and DeeQu.

AWS Glue DataBrew data quality rules

Glue DataBrew data quality rules allow you to manage data quality to detect typical data issues easily. In this section, we will use a human resources dataset (https://eforexcel.com/wp/downloads-16-sample-csv-files-data-sets-for-testing/).

Follow these steps to manage...

Summary

In this chapter, you learned how to manage, clean up, and enrich your data using various functionalities available on AWS Glue and Apache Spark. In terms of normalizing data, you looked at several techniques, including schema enforcement, timestamp handling, and others. To deduplicate records, you experimented with using ML transforms with a sample dataset, while to denormalize tables, you joined multiple tables and enriched the data to optimize the analytic workload. When learning about masking and hashing values, you performed basic ETL to improve security. Moreover, you learned that Glue PII Detection helps you choose confidential columns dynamically. Finally, you learned how to manage data quality with Glue DataBrew data quality rules and DeeQu.

In the next chapter, you will learn about the best practices for managing metadata on data lakes.

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Serverless ETL and Analytics with AWS Glue
Published in: Aug 2022Publisher: PacktISBN-13: 9781800564985
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at €14.99/month. Cancel anytime

Authors (6)

author image
Vishal Pathak

Vishal Pathak is a Data Lab Solutions Architect at AWS. Vishal works with customers on their use cases, architects solutions to solve their business problems, and helps them build scalable prototypes. Prior to his journey in AWS, Vishal helped customers implement business intelligence, data warehouse, and data lake projects in the US and Australia.
Read more about Vishal Pathak

author image
Subramanya Vajiraya

Subramanya Vajiraya is a Big data Cloud Engineer at AWS Sydney specializing in AWS Glue. He obtained his Bachelor of Engineering degree specializing in Information Science & Engineering from NMAM Institute of Technology, Nitte, KA, India (Visvesvaraya Technological University, Belgaum) in 2015 and obtained his Master of Information Technology degree specialized in Internetworking from the University of New South Wales, Sydney, Australia in 2017. He is passionate about helping customers solve challenging technical issues related to their ETL workload and implementing scalable data integration and analytics pipelines on AWS.
Read more about Subramanya Vajiraya

author image
Noritaka Sekiyama

Noritaka Sekiyama is a Senior Big Data Architect on the AWS Glue and AWS Lake Formation team. He has 11 years of experience working in the software industry. Based in Tokyo, Japan, he is responsible for implementing software artifacts, building libraries, troubleshooting complex issues and helping guide customer architectures
Read more about Noritaka Sekiyama

author image
Tomohiro Tanaka

Tomohiro Tanaka is a senior cloud support engineer at AWS. He works to help customers solve their issues and build data lakes across AWS Glue, AWS IoT, and big data technologies such Apache Spark, Hadoop, and Iceberg.
Read more about Tomohiro Tanaka

author image
Albert Quiroga

Albert Quiroga works as a senior solutions architect at Amazon, where he is helping to design and architect one of the largest data lakes in the world. Prior to that, he spent four years working at AWS, where he specialized in big data technologies such as EMR and Athena, and where he became an expert on AWS Glue. Albert has worked with several Fortune 500 companies on some of the largest data lakes in the world and has helped to launch and develop features for several AWS services.
Read more about Albert Quiroga

author image
Ishan Gaur

Ishan Gaur has more than 13 years of IT experience in soft ware development and data engineering, building distributed systems and highly scalable ETL pipelines using Apache Spark, Scala, and various ETL tools such as Ab Initio and Datastage. He currently works at AWS as a senior big data cloud engineer and is an SME of AWS Glue. He is responsible for helping customers to build out large, scalable distributed systems and implement them in AWS cloud environments using various big data services, including EMR, Glue, and Athena, as well as other technologies, such as Apache Spark, Hadoop, and Hive.
Read more about Ishan Gaur