You're reading from Serverless ETL and Analytics with AWS Glue

Product type Book

Published in Aug 2022

Publisher Packt

ISBN-13 9781800564985

Pages 434 pages

Edition 1st Edition

Languages

Python

Concepts

Data Analysis

Authors (6):

Vishal Pathak

Subramanya Vajiraya

Noritaka Sekiyama

Tomohiro Tanaka

Albert Quiroga

Ishan Gaur

View More author details

Table of Contents (20) Chapters

Preface

Section 1 – Introduction, Concepts, and the Basics of AWS Glue

Chapter 1: Data Management – Introduction and Concepts

Chapter 2: Introduction to Important AWS Glue Features

Chapter 3: Data Ingestion

Section 2 – Data Preparation, Management, and Security

Chapter 4: Data Preparation

Chapter 5: Data Layouts

Chapter 6: Data Management

Chapter 7: Metadata Management

Chapter 8: Data Security

Chapter 9: Data Sharing

Chapter 10: Data Pipeline Management

Section 3 – Tuning, Monitoring, Data Lake Common Scenarios, and Interesting Edge Cases

Chapter 11: Monitoring

Chapter 12: Tuning, Debugging, and Troubleshooting

Chapter 13: Data Analysis

Chapter 14: Machine Learning Integration

Chapter 15: Architecting Data Lakes for Real-World Scenarios and Edge Cases

Other Books You May Enjoy

Chapter 4: Data Preparation

In the previous chapter, we explored fundamental concepts surrounding data ingestion and how we can leverage AWS Glue to ingest data from various sources, such as file/object stores, JDBC data stores, streaming data sources, and SaaS data stores. We also discussed different features of AWS Glue ETL, such as schema flexibility, schema conflict resolution, advanced ETL transformations and extensions, incremental data ingestion using job bookmarks, grouping, and workload partitioning using bounded execution in detail with practical examples. Doing so allowed us to understand how each of these features can be used to ingest data from data stores in specific use cases.

In this chapter, we will be introducing the fundamental concepts related to data preparation, different strategies that can help choose the right service/tool for a specific use case, visual data preparation, and programmatic data preparation using AWS Glue.

Upon completing this chapter...

Technical requirements

Please refer to the Technical requirements section in Chapter 3, Data Ingestion, as they are the same for this chapter as well.

In the upcoming sections, we will be discussing the fundamental concepts of data preparation, the importance of data preparation, and how we can prepare data using different tools/services in AWS Glue.

Introduction to data preparation

Data preparation can be defined as the process of sanitizing and normalizing the dataset using a combination of transformations to prepare the data for downstream consumers. In a typical data integration workflow, prepared data is consumed by analytics applications, visualization tools, and machine learning pipelines. It is not uncommon for the prepared data to be ingested by other data processing pipelines, depending on the requirements of the consuming entity.

When we consider a typical data integration workflow, quite often, data preparation is one of the more challenging and time-consuming tasks. It is important to ensure the data is prepared correctly according to the requirements as this impacts the subsequent steps in the data integration workflow significantly.

The complexity of the data preparation process depends on several factors, such as the schema of the source data, schema drift, the volume of data, the transformations to be applied...

Data preparation using AWS Glue

It is normal for data to grow continuously over time in terms of volume and complexity, considering the huge number of applications and devices generating data in a typical organization. With this ever-growing data, a tremendous amount of resources are required to ingest and prepare this data – both in terms of manpower and compute resources.

AWS Glue makes it easy for individuals with varying levels of skill to collaborate on data preparation tasks. For instance, novice users with no programming skills can take advantage of AWS Glue DataBrew (https://aws.amazon.com/glue/features/databrew/), a visual data preparation tool that allows data engineers/analysts/scientists to interact with and prepare the data using a variety of pre-built transformations and filtering mechanisms without writing any code.

While AWS Glue DataBrew is a great tool for preparing data using a graphical user interface (GUI), there are some use cases where the built...

Selecting the right service/tool

In the previous sections, we looked at the different features, transformations, and extensions/APIs that are available in AWS Glue DataBrew, AWS Glue Studio, and AWS Glue ETL for preparing data. With all the choices available and the varying sets of features in each of these tools, how do we pick a tool/service for our use case? There is no hard and fast rule in selecting a tool/service and the choice depends on several factors that need to be considered based on the use case.

As discussed earlier in this chapter, AWS Glue DataBrew empowers data analysts and data scientists to prepare data without writing source code. AWS Glue ETL, on the other hand, has a higher learning curve and requires Python/Scala programming knowledge and a fundamental understanding of Apache Spark. So, if the individuals preparing the data are not skilled in AWS Glue/Spark ETL programming, they can use AWS Glue DataBrew.

One of the important factors to consider while...

Summary

In this chapter, we discussed the fundamental concepts and importance of data preparation within a data integration workflow. We explored how we can prepare data in AWS Glue using both visual interfaces and source code.

We explored different features of AWS Glue DataBrew and saw how we can implement profile jobs to profile the data and gather insights about the dataset being processed, as well as how to use a DQ Ruleset to enrich the data profile, use PII detection and redaction, and perform column encryption using deterministic and probabilistic encryption. We also discussed how we can apply transformations, build a recipe using those transformations, create a job using that recipe, and run the job.

Then, we discussed source code-based ETL development using AWS Glue ETL jobs and the different features of AWS Glue Studio before exploring some of the popular transformations and extensions available in AWS Glue ETL. We saw how these transformations can be used in specific...