Reader small image

You're reading from  Data Engineering with AWS - Second Edition

Product typeBook
Published inOct 2023
PublisherPackt
ISBN-139781804614426
Edition2nd Edition
Right arrow
Author (1)
Gareth Eagar
Gareth Eagar
author image
Gareth Eagar

Gareth Eagar has over 25 years of experience in the IT industry, starting in South Africa, working in the United Kingdom for a while, and now based in the USA. Having worked at AWS since 2017, Gareth has broad experience with a variety of AWS services, and deep expertise around building data platforms on AWS. While Gareth currently works as a Solutions Architect, he has also worked in AWS Professional Services, helping architect and implement data platforms for global customers. Gareth frequently speaks on data related topics.
Read more about Gareth Eagar

Right arrow

Orchestrating the Data Pipeline

Throughout this book, we have discussed various services that can be used by data engineers to ingest and transform data, as well as make it available for consumers. We looked at how we could ingest data via Amazon Kinesis Data Firehose and AWS Database Migration Service (DMS), and how we could run AWS Lambda and AWS Glue functions to transform our data. We also discussed the importance of updating a data catalog as new datasets are added to a data lake, and how we can load subsets of data into a data mart or data warehouse for specific use cases.

For the hands-on exercises, we made use of various services, but for the most part, we triggered these services manually. However, in a real production environment, it would not be acceptable to have to manually trigger these tasks, so we need a way to automate various data engineering tasks. This is where data pipeline orchestration tools come in.

Modern-day ETL applications are designed with a modular...

Technical requirements

To complete the hands-on exercises in this chapter, you will need an AWS account where you have access to a user with administrator privileges (as covered in Chapter 1, An Introduction to Data Engineering). We will make use of various AWS services, including AWS Lambda, AWS Step Functions, and Amazon Simple Notification Service (SNS).

You can find the code files of this chapter in the GitHub repository using the following link: https://github.com/PacktPublishing/Data-Engineering-with-AWS-2nd-edition/tree/main/Chapter10

Understanding the core concepts for pipeline orchestration

In Chapter 5, Architecting Data Engineering Pipelines, we architected a high-level overview of a data pipeline. We examined potential data sources, discussed the types of data transformations that may be required, and looked at how we could make transformed data available to our data consumers.

Then, we examined the topics of data ingestion, transformation, and how to load transformed data into data marts in more detail in the subsequent chapters. As we discussed previously, these steps are often referred to as an Extract, Transform, Load (ETL) process.

We have now come to the part where we need to combine the individual steps involved in our ETL processes to operationalize and automate how we process data. But before we look deeper at the AWS services to enable this, let’s examine some of the key concepts around pipeline orchestration.

What is a data pipeline, and how do you orchestrate it?

A simple...

Examining the options for orchestrating pipelines in AWS

As you will have noticed throughout this book, AWS offers many different building blocks for architecting solutions. When it comes to pipeline orchestration, AWS provides native serverless orchestration engines with AWS Data Pipeline and AWS Step Functions, a managed open-source project with Amazon Managed Workflows for Apache Airflow (MWAA), and service-specific orchestration with AWS Glue workflows.

There are pros and cons to using each of these solutions, depending on your use case. When making a decision on this, there are multiple factors to consider, such as the level of management effort, the ease of integration with your target ETL engine, logging, error-handling mechanisms, cost, and platform independence.

In this section, we’ll examine each of the four pipeline orchestration options.

AWS Data Pipeline (now in maintenance mode)

AWS Data Pipeline is one of the oldest services that AWS has for creating...

Hands-on – orchestrating a data pipeline using AWS Step Functions

In this section, we will get hands-on with the AWS Step Functions service, which can be used to orchestrate data pipelines. The pipeline we’re going to orchestrate is relatively simple, but Step Functions can also be used to orchestrate far more complex pipelines with many steps. To keep things simple, we will only use Lambda functions to process our data, but you could replace Lambda functions with Glue jobs in production pipelines that need to process large amounts of data.

For our Step Functions state machine, let’s start by running a Lambda function that checks the extension of an incoming file to determine the type of file. Once determined, we’ll pass that information on to the next state, which is a CHOICE state. If it is a file type we support, we’ll call a Lambda function to process the file, but if it’s not, we’ll send out a notification, indicating that...

Summary

In this chapter, we looked at a critical part of a data engineer’s job–designing and orchestrating data pipelines. First, we examined some of the core concepts around data pipelines, such as scheduled and event-based pipelines, and how to handle failures and retries.

We then looked at four different AWS services that can be used for creating and orchestrating data pipelines. This included AWS Data Pipeline (now in maintenance mode), AWS Glue workflows, Amazon MWAA, and AWS Step Functions.

Then, in the hands-on section of this chapter, we built an event-driven pipeline. We used two AWS Lambda functions for processing, and an Amazon SNS topic for sending out notifications about failures. Then, we put these pieces of our data pipeline together into a state machine orchestrated by AWS Step Functions. We also looked at how to handle errors.

So far, we have looked at how to design the high-level architecture for a data pipeline and examined services for...

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Data Engineering with AWS - Second Edition
Published in: Oct 2023Publisher: PacktISBN-13: 9781804614426
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Gareth Eagar

Gareth Eagar has over 25 years of experience in the IT industry, starting in South Africa, working in the United Kingdom for a while, and now based in the USA. Having worked at AWS since 2017, Gareth has broad experience with a variety of AWS services, and deep expertise around building data platforms on AWS. While Gareth currently works as a Solutions Architect, he has also worked in AWS Professional Services, helping architect and implement data platforms for global customers. Gareth frequently speaks on data related topics.
Read more about Gareth Eagar