Search icon
Arrow left icon
All Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletters
Free Learning
Arrow right icon
Serverless ETL and Analytics with AWS Glue

You're reading from  Serverless ETL and Analytics with AWS Glue

Product type Book
Published in Aug 2022
Publisher Packt
ISBN-13 9781800564985
Pages 434 pages
Edition 1st Edition
Languages
Authors (6):
Vishal Pathak Vishal Pathak
Profile icon Vishal Pathak
Subramanya Vajiraya Subramanya Vajiraya
Profile icon Subramanya Vajiraya
Noritaka Sekiyama Noritaka Sekiyama
Profile icon Noritaka Sekiyama
Tomohiro Tanaka Tomohiro Tanaka
Profile icon Tomohiro Tanaka
Albert Quiroga Albert Quiroga
Profile icon Albert Quiroga
Ishan Gaur Ishan Gaur
Profile icon Ishan Gaur
View More author details

Table of Contents (20) Chapters

Preface Section 1 – Introduction, Concepts, and the Basics of AWS Glue
Chapter 1: Data Management – Introduction and Concepts Chapter 2: Introduction to Important AWS Glue Features Chapter 3: Data Ingestion Section 2 – Data Preparation, Management, and Security
Chapter 4: Data Preparation Chapter 5: Data Layouts Chapter 6: Data Management Chapter 7: Metadata Management Chapter 8: Data Security Chapter 9: Data Sharing Chapter 10: Data Pipeline Management Section 3 – Tuning, Monitoring, Data Lake Common Scenarios, and Interesting Edge Cases
Chapter 11: Monitoring Chapter 12: Tuning, Debugging, and Troubleshooting Chapter 13: Data Analysis Chapter 14: Machine Learning Integration Chapter 15: Architecting Data Lakes for Real-World Scenarios and Edge Cases Other Books You May Enjoy

Chapter 4: Data Preparation

In the previous chapter, we explored fundamental concepts surrounding data ingestion and how we can leverage AWS Glue to ingest data from various sources, such as file/object stores, JDBC data stores, streaming data sources, and SaaS data stores. We also discussed different features of AWS Glue ETL, such as schema flexibility, schema conflict resolution, advanced ETL transformations and extensions, incremental data ingestion using job bookmarks, grouping, and workload partitioning using bounded execution in detail with practical examples. Doing so allowed us to understand how each of these features can be used to ingest data from data stores in specific use cases.

In this chapter, we will be introducing the fundamental concepts related to data preparation, different strategies that can help choose the right service/tool for a specific use case, visual data preparation, and programmatic data preparation using AWS Glue.

Upon completing this chapter...

Technical requirements

Please refer to the Technical requirements section in Chapter 3, Data Ingestion, as they are the same for this chapter as well.

In the upcoming sections, we will be discussing the fundamental concepts of data preparation, the importance of data preparation, and how we can prepare data using different tools/services in AWS Glue.

Introduction to data preparation

Data preparation can be defined as the process of sanitizing and normalizing the dataset using a combination of transformations to prepare the data for downstream consumers. In a typical data integration workflow, prepared data is consumed by analytics applications, visualization tools, and machine learning pipelines. It is not uncommon for the prepared data to be ingested by other data processing pipelines, depending on the requirements of the consuming entity.

When we consider a typical data integration workflow, quite often, data preparation is one of the more challenging and time-consuming tasks. It is important to ensure the data is prepared correctly according to the requirements as this impacts the subsequent steps in the data integration workflow significantly.

The complexity of the data preparation process depends on several factors, such as the schema of the source data, schema drift, the volume of data, the transformations to be applied...

Data preparation using AWS Glue

It is normal for data to grow continuously over time in terms of volume and complexity, considering the huge number of applications and devices generating data in a typical organization. With this ever-growing data, a tremendous amount of resources are required to ingest and prepare this data – both in terms of manpower and compute resources.

AWS Glue makes it easy for individuals with varying levels of skill to collaborate on data preparation tasks. For instance, novice users with no programming skills can take advantage of AWS Glue DataBrew (https://aws.amazon.com/glue/features/databrew/), a visual data preparation tool that allows data engineers/analysts/scientists to interact with and prepare the data using a variety of pre-built transformations and filtering mechanisms without writing any code.

While AWS Glue DataBrew is a great tool for preparing data using a graphical user interface (GUI), there are some use cases where the built...

Selecting the right service/tool

In the previous sections, we looked at the different features, transformations, and extensions/APIs that are available in AWS Glue DataBrew, AWS Glue Studio, and AWS Glue ETL for preparing data. With all the choices available and the varying sets of features in each of these tools, how do we pick a tool/service for our use case? There is no hard and fast rule in selecting a tool/service and the choice depends on several factors that need to be considered based on the use case.

As discussed earlier in this chapter, AWS Glue DataBrew empowers data analysts and data scientists to prepare data without writing source code. AWS Glue ETL, on the other hand, has a higher learning curve and requires Python/Scala programming knowledge and a fundamental understanding of Apache Spark. So, if the individuals preparing the data are not skilled in AWS Glue/Spark ETL programming, they can use AWS Glue DataBrew.

One of the important factors to consider while...

Summary

In this chapter, we discussed the fundamental concepts and importance of data preparation within a data integration workflow. We explored how we can prepare data in AWS Glue using both visual interfaces and source code.

We explored different features of AWS Glue DataBrew and saw how we can implement profile jobs to profile the data and gather insights about the dataset being processed, as well as how to use a DQ Ruleset to enrich the data profile, use PII detection and redaction, and perform column encryption using deterministic and probabilistic encryption. We also discussed how we can apply transformations, build a recipe using those transformations, create a job using that recipe, and run the job.

Then, we discussed source code-based ETL development using AWS Glue ETL jobs and the different features of AWS Glue Studio before exploring some of the popular transformations and extensions available in AWS Glue ETL. We saw how these transformations can be used in specific...

lock icon The rest of the chapter is locked
You have been reading a chapter from
Serverless ETL and Analytics with AWS Glue
Published in: Aug 2022 Publisher: Packt ISBN-13: 9781800564985
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime}