You're reading from Data Engineering with AWS - Second Edition

Product type Book

Published in Oct 2023

Publisher Packt

ISBN-13 9781804614426

Pages 636 pages

Edition 2nd Edition

Languages

Concepts

Data Engineering

Author (1):

Gareth Eagar

Table of Contents (24) Chapters

Preface

Section 1: AWS Data Engineering Concepts and Trends

An Introduction to Data Engineering

Data Management Architectures for Analytics

The AWS Data Engineer’s Toolkit

Data Governance, Security, and Cataloging

Section 2: Architecting and Implementing Data Engineering Pipelines and Transformations

Architecting Data Engineering Pipelines

Ingesting Batch and Streaming Data

Transforming Data to Optimize for Analytics

Identifying and Enabling Data Consumers

A Deeper Dive into Data Marts and Amazon Redshift

Orchestrating the Data Pipeline

Section 3: The Bigger Picture: Data Analytics, Data Visualization, and Machine Learning

Ad Hoc Queries with Amazon Athena

Visualizing Data with Amazon QuickSight

Enabling Artificial Intelligence and Machine Learning

Section 4: Modern Strategies: Open Table Formats, Data Mesh, DataOps, and Preparing for the Real World

Building Transactional Data Lakes

Implementing a Data Mesh Strategy

Building a Modern Data Platform on AWS

Wrapping Up the First Part of Your Learning Journey

Other Books You May Enjoy

Index

Transforming Data to Optimize for Analytics

In previous chapters, we covered how to architect a data pipeline and common ways of ingesting data into a data lake. We now turn to the process of transforming raw data in order to optimize the data for analytics, enabling an organization to efficiently gain new insights into their data.

Transforming data to optimize for analytics and create value for an organization is one of the key tasks for a data engineer, and there are many different types of transformations. Some transformations are common and can be generically applied to a dataset, such as converting raw files to Parquet format and partitioning the dataset. Other transformations use business logic in the transformations and vary based on the contents of the data and the specific business requirements.

In this chapter, we review some of the engines that are available in AWS for performing data transformations and also discuss some of the more common data transformations...

Technical requirements

For the hands-on tasks in this chapter, you need access to the AWS Glue service, including AWS Glue Studio. You also need to be able to create a new S3 bucket and new IAM policies.

You can find the code files of this chapter in the GitHub repository using the following link: https://github.com/PacktPublishing/Data-Engineering-with-AWS-2nd-edition/tree/main/Chapter07

Overview of how transformations can create value

As we have discussed in various places throughout this book, data can be one of the most valuable assets that an organization owns. However, raw, siloed data has limited value on its own, and we unlock the real value of an organization’s data when we combine various raw datasets and transform that data through an analytics pipeline.

Cooking, baking, and data transformations

Look at the following list of food items and consider whether you enjoy eating them:

Sugar
Butter
Eggs
Milk

For many people, these are pretty standard food items, and some (like the eggs and milk) may be consumed on their own, while others (like the sugar and the butter) are generally consumed with something else, such as adding sugar to your coffee or tea or spreading butter on bread.

But, if you take those items and add a few more (like flour and baking powder) and combine all the items in just the right...

Types of data transformation tools

As we covered in Chapter 3, The AWS Data Engineer’s Toolkit, there are a number of AWS services that can be used for data transformation. We reviewed a number of these services in that chapter, so make sure to review it again, but in this section, we will look more broadly at the different types of data transformation engines.

Apache Spark

Apache Spark is an in-memory engine for working with large datasets, providing a mechanism to split a dataset among multiple nodes in a cluster for efficient processing. Spark is an extremely popular engine to use for processing and transforming big datasets, and there are multiple ways to run Spark jobs within AWS.

With Apache Spark, you can either process data in batches (such as on a daily basis or every few hours) or process near real-time streaming data using Spark Streaming. In addition, you can use Spark SQL to process data using standard SQL, and Spark ML for applying machine learning...

Common data preparation transformations

The first set of transformations that we look at are those that help prepare the data for further transformations later in the pipeline. These transformations are designed to apply relatively generic optimizations to individual datasets that we are ingesting into the data lake. For these optimizations, you may need some understanding of the source data system and context, but, generally, you do not need to understand the ultimate business use case for the dataset.

Protecting PII data

Often, datasets that we ingest may contain personally identifiable information (PII) data, and there may be governance restrictions on which PII data can be stored in the data lake. As a result, we need to have a process that protects the PII data as soon as possible after it is ingested.

There are a number of common approaches that can be used here (such as tokenization or hashing), each with its own advantages and disadvantages, as we discussed in...

Common business use case transformations

In a data lake environment, you generally ingest data from many different source systems into a landing, or raw, zone. You then optimize the file format and partition the dataset, as well as applying cleansing rules to the data, potentially now storing the data in a different zone, often referred to as the clean zone. At this point, you may also apply updates to the dataset with CDC-type data and create the latest view of the data, which we examine in the next section.

The initial transforms we covered in the previous section could be completed without needing to understand too much about how the data is going to ultimately be used by the business. At that point, we were still working on individual datasets that will be used by downstream transformation pipelines to ultimately prepare the data for business analytics.

But at some point, you, or another data engineer working for a line of business, are going to need to use a variety of...

Working with Change Data Capture (CDC) data

One of the most challenging aspects of working within a data lake environment is the processing of updates to existing data, such as with Change Data Capture (CDC) data. We have discussed CDC data previously, but as a reminder, this is data that contains updates to an existing dataset.

A good example of this is data that comes from a relational database system. After the initial loading of data to the data lake is complete, a system (such as Amazon DMS) can read the database transaction logs and write all future database updates to Amazon S3. For each row written to Amazon S3, the first column of the CDC file would contain one of the following characters (see the section on Amazon DMS in Chapter 3, The AWS Data Engineer’s Toolkit, for an example of a CDC file generated by Amazon DMS):

I – Insert: This indicates that this row contains data that was newly inserted into the table
U – Update: This indicates...

Hands-on – joining datasets with AWS Glue Studio

For our hands-on exercise in this chapter, we are going to use AWS Glue Studio to create an Apache Spark job that joins streaming data with data we migrated from our MySQL database in the previous chapter.

Creating a new data lake zone – the curated zone

As discussed in Chapter 2, Data Management Architecture for Analytics, it is common to have multiple zones in a data lake, containing different copies of our data as it gets transformed. So far, we have ingested raw data into the landing zone and then converted some of those datasets into Parquet format, and written the files out in the clean zone. In this chapter, we will be joining multiple datasets together and will write out the new dataset to the curated zone of our data lake. The curated zone is intended to store data that has been transformed and is ready for consumption by data consumers. We created an Amazon S3 bucket for the curated zone in a previous...

Summary

In this chapter, we’ve reviewed a number of common transformations that can be applied to raw datasets, covering both generic transformations used to optimize data for analytics and business transforms to enrich and denormalize datasets.

This chapter built on previous chapters in this book. We started by looking at how to architect a data pipeline, then reviewed ways to ingest different data types into a data lake, and in this chapter, we reviewed common data transformations.

In the next chapter, we will look at common types of data consumers and learn more about how different data consumers want to access data in different ways, and with different tools.

Learn more on Discord

To join the Discord community for this book – where you can share feedback, ask questions to the author, and learn about new releases – follow the QR code below:

https://discord.gg/9s5mHNyECd