Search icon
Arrow left icon
All Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletters
Free Learning
Arrow right icon
Data Engineering with AWS - Second Edition

You're reading from  Data Engineering with AWS - Second Edition

Product type Book
Published in Oct 2023
Publisher Packt
ISBN-13 9781804614426
Pages 636 pages
Edition 2nd Edition
Languages
Author (1):
Gareth Eagar Gareth Eagar
Profile icon Gareth Eagar

Table of Contents (24) Chapters

Preface Section 1: AWS Data Engineering Concepts and Trends
An Introduction to Data Engineering Data Management Architectures for Analytics The AWS Data Engineer’s Toolkit Data Governance, Security, and Cataloging Section 2: Architecting and Implementing Data Engineering Pipelines and Transformations
Architecting Data Engineering Pipelines Ingesting Batch and Streaming Data Transforming Data to Optimize for Analytics Identifying and Enabling Data Consumers A Deeper Dive into Data Marts and Amazon Redshift Orchestrating the Data Pipeline Section 3: The Bigger Picture: Data Analytics, Data Visualization, and Machine Learning
Ad Hoc Queries with Amazon Athena Visualizing Data with Amazon QuickSight Enabling Artificial Intelligence and Machine Learning Section 4: Modern Strategies: Open Table Formats, Data Mesh, DataOps, and Preparing for the Real World
Building Transactional Data Lakes Implementing a Data Mesh Strategy Building a Modern Data Platform on AWS Wrapping Up the First Part of Your Learning Journey Other Books You May Enjoy
Index

Transforming Data to Optimize for Analytics

In previous chapters, we covered how to architect a data pipeline and common ways of ingesting data into a data lake. We now turn to the process of transforming raw data in order to optimize the data for analytics, enabling an organization to efficiently gain new insights into their data.

Transforming data to optimize for analytics and create value for an organization is one of the key tasks for a data engineer, and there are many different types of transformations. Some transformations are common and can be generically applied to a dataset, such as converting raw files to Parquet format and partitioning the dataset. Other transformations use business logic in the transformations and vary based on the contents of the data and the specific business requirements.

In this chapter, we review some of the engines that are available in AWS for performing data transformations and also discuss some of the more common data transformations...

Technical requirements

For the hands-on tasks in this chapter, you need access to the AWS Glue service, including AWS Glue Studio. You also need to be able to create a new S3 bucket and new IAM policies.

You can find the code files of this chapter in the GitHub repository using the following link: https://github.com/PacktPublishing/Data-Engineering-with-AWS-2nd-edition/tree/main/Chapter07

Overview of how transformations can create value

As we have discussed in various places throughout this book, data can be one of the most valuable assets that an organization owns. However, raw, siloed data has limited value on its own, and we unlock the real value of an organization’s data when we combine various raw datasets and transform that data through an analytics pipeline.

Cooking, baking, and data transformations

Look at the following list of food items and consider whether you enjoy eating them:

  • Sugar
  • Butter
  • Eggs
  • Milk

For many people, these are pretty standard food items, and some (like the eggs and milk) may be consumed on their own, while others (like the sugar and the butter) are generally consumed with something else, such as adding sugar to your coffee or tea or spreading butter on bread.

But, if you take those items and add a few more (like flour and baking powder) and combine all the items in just the right...

Types of data transformation tools

As we covered in Chapter 3, The AWS Data Engineer’s Toolkit, there are a number of AWS services that can be used for data transformation. We reviewed a number of these services in that chapter, so make sure to review it again, but in this section, we will look more broadly at the different types of data transformation engines.

Apache Spark

Apache Spark is an in-memory engine for working with large datasets, providing a mechanism to split a dataset among multiple nodes in a cluster for efficient processing. Spark is an extremely popular engine to use for processing and transforming big datasets, and there are multiple ways to run Spark jobs within AWS.

With Apache Spark, you can either process data in batches (such as on a daily basis or every few hours) or process near real-time streaming data using Spark Streaming. In addition, you can use Spark SQL to process data using standard SQL, and Spark ML for applying machine learning...

Common data preparation transformations

The first set of transformations that we look at are those that help prepare the data for further transformations later in the pipeline. These transformations are designed to apply relatively generic optimizations to individual datasets that we are ingesting into the data lake. For these optimizations, you may need some understanding of the source data system and context, but, generally, you do not need to understand the ultimate business use case for the dataset.

Protecting PII data

Often, datasets that we ingest may contain personally identifiable information (PII) data, and there may be governance restrictions on which PII data can be stored in the data lake. As a result, we need to have a process that protects the PII data as soon as possible after it is ingested.

There are a number of common approaches that can be used here (such as tokenization or hashing), each with its own advantages and disadvantages, as we discussed in...

Common business use case transformations

In a data lake environment, you generally ingest data from many different source systems into a landing, or raw, zone. You then optimize the file format and partition the dataset, as well as applying cleansing rules to the data, potentially now storing the data in a different zone, often referred to as the clean zone. At this point, you may also apply updates to the dataset with CDC-type data and create the latest view of the data, which we examine in the next section.

The initial transforms we covered in the previous section could be completed without needing to understand too much about how the data is going to ultimately be used by the business. At that point, we were still working on individual datasets that will be used by downstream transformation pipelines to ultimately prepare the data for business analytics.

But at some point, you, or another data engineer working for a line of business, are going to need to use a variety of...

Working with Change Data Capture (CDC) data

One of the most challenging aspects of working within a data lake environment is the processing of updates to existing data, such as with Change Data Capture (CDC) data. We have discussed CDC data previously, but as a reminder, this is data that contains updates to an existing dataset.

A good example of this is data that comes from a relational database system. After the initial loading of data to the data lake is complete, a system (such as Amazon DMS) can read the database transaction logs and write all future database updates to Amazon S3. For each row written to Amazon S3, the first column of the CDC file would contain one of the following characters (see the section on Amazon DMS in Chapter 3, The AWS Data Engineer’s Toolkit, for an example of a CDC file generated by Amazon DMS):

  1. I – Insert: This indicates that this row contains data that was newly inserted into the table
  2. U – Update: This indicates...

Hands-on – joining datasets with AWS Glue Studio

For our hands-on exercise in this chapter, we are going to use AWS Glue Studio to create an Apache Spark job that joins streaming data with data we migrated from our MySQL database in the previous chapter.

Creating a new data lake zone – the curated zone

As discussed in Chapter 2, Data Management Architecture for Analytics, it is common to have multiple zones in a data lake, containing different copies of our data as it gets transformed. So far, we have ingested raw data into the landing zone and then converted some of those datasets into Parquet format, and written the files out in the clean zone. In this chapter, we will be joining multiple datasets together and will write out the new dataset to the curated zone of our data lake. The curated zone is intended to store data that has been transformed and is ready for consumption by data consumers. We created an Amazon S3 bucket for the curated zone in a previous...

Summary

In this chapter, we’ve reviewed a number of common transformations that can be applied to raw datasets, covering both generic transformations used to optimize data for analytics and business transforms to enrich and denormalize datasets.

This chapter built on previous chapters in this book. We started by looking at how to architect a data pipeline, then reviewed ways to ingest different data types into a data lake, and in this chapter, we reviewed common data transformations.

In the next chapter, we will look at common types of data consumers and learn more about how different data consumers want to access data in different ways, and with different tools.

Learn more on Discord

To join the Discord community for this book – where you can share feedback, ask questions to the author, and learn about new releases – follow the QR code below:

https://discord.gg/9s5mHNyECd

lock icon The rest of the chapter is locked
You have been reading a chapter from
Data Engineering with AWS - Second Edition
Published in: Oct 2023 Publisher: Packt ISBN-13: 9781804614426
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime}