Reader small image

You're reading from  Serverless Analytics with Amazon Athena

Product typeBook
Published inNov 2021
Reading LevelBeginner
PublisherPackt
ISBN-139781800562349
Edition1st Edition
Languages
Right arrow
Authors (3):
Anthony Virtuoso
Anthony Virtuoso
author image
Anthony Virtuoso

Anthony Virtuoso works as a Principal Engineer at Amazon and holds multiple patents in distributed systems, software defined networks, and security. In his eight years at Amazon, he has helped launch several Amazon Web Services, the most recent of which was Amazon Managed Blockchain. As one of the original authors of Athena Query Federation, you'll often find him lurking on the Athena Federation GitHub repository answering questions and shipping bug fixes. When not at work, Anthony obsesses over a different set of customers, namely his wife and two little boys, aged 2 and 5. His kids enjoy doing science experiments with dad, like 3D printing toys, building with Lego, or searching the local pond for tardigrades.
Read more about Anthony Virtuoso

Mert Turkay Hocanin
Mert Turkay Hocanin
author image
Mert Turkay Hocanin

Mert Turkay Hocanin is a Principal Big Data Architect at Amazon Web Services within the AWS Glue and AWS Lake Formation services and has previously worked for several other services including Amazon Athena, Amazon EMR, Amazon Managed Blockchain. During his time at AWS, he worked with several Fortune 500 companies on some of the largest data lakes in the world and was involved with the launching of three Amazon Web Services. Prior to being a Big Data Architect, he was a Senior Software Developer within Amazon's retail systems organization building one of the earliest data lakes in the company in 2013. When he is not helping customers build data lakes, he enjoys spending time with his wife-Subrina, son-Tristan, and exploring New York City.
Read more about Mert Turkay Hocanin

Aaron Wishnick
Aaron Wishnick
author image
Aaron Wishnick

Aaron Wishnick works as a Senior Software Engineer at Amazon, where he has been for 7 years. During that time he has worked on Amazon's payment systems, financial intelligence systems, as well as working for AWS on Athena and AWS Proton. When not at work, Aaron and his fiance, Alyssa, are on a quest to determine just how much dog fur is too much, with their husky and malamute, Mina and Wally.
Read more about Aaron Wishnick

View More author details
Right arrow

Chapter 9: Serverless ETL Pipelines

In the previous chapter, you learned how to tame unstructured or loosely structured data using Athena to manipulate logs, JavaScript Object Notation (JSON), and other types of machine-generated data. In this chapter, we'll continue with the theme of controlling chaos by using automation to normalize newly arrived data through a process known as extract, transform, load (ETL). We start with a brief explanation of ETL, and once we've established a basic understanding of ETL processes, we will move on to best practices and common pitfalls of using Athena for ETL.

As with most of the chapters in this book, we'll then get hands-on by designing and implementing a serverless ETL pipeline. More precisely, we'll implement the serverless ETL pipeline discussed in Chapter 2, Introduction to Amazon Athena. In that chapter, we described a fictional hedge fund with a propensity for trading widely shorted meme stocks. Their equally fictional...

Technical requirements

Wherever possible, we will provide samples or instructions to guide you through the setup. However, to complete the activities in this chapter, you will need to ensure you have the following prerequisites available. Our command-line examples will be executed using Ubuntu, but most Linux flavors should work without modification, including Ubuntu on Windows Subsystem for Linux (WSL).

You will need internet access to GitHub, S3, and the Amazon Web Services (AWS) console.

You will also require a computer with the following installed:

  • Chrome, Safari, or Microsoft Edge browser
  • The AWS Command-Line Interface (CLI) installed

This chapter also requires you to have an AWS account and an accompanying Identity and Access Management (IAM) user (or role) with sufficient privileges to complete this chapter's activities. Throughout this book, we will provide detailed IAM policies that attempt to honor the age-old best practice of "least privilege...

Understanding the uses of ETL

In the most literal terms, ETL refers to a procedure with three conceptual phases that begin with reading data from a source system and end with a derivative of the original data being stored into a target system. In between these deceptively simple steps sits the most important facet of ETL, the transformation from the source system's semantic and physical schema to the domain model expected by the target system. In this step, we are essentially integrating source and target systems that may represent data differently.

Much of the academic literature on ETL points to the expansion of data warehousing concepts in the 1970s as its origin. It was a time when businesses rapidly adopted databases and found themselves with multiple data repositories, often using incompatible formats. Sounds familiar? Fast forward to today, and not much has changed aside from the date. The ability to integrate data from siloed or incompatible systems continues to be...

Deciding whether to ETL or query in place

The distinction between ETL and querying in place is blurred when using a service such as Athena. In the preceding sections, we reviewed common ETL use cases. In this section, we'll unpack the details that should go into deciding when the downsides of querying in place tilt the scale in favor of ETL. You might be curious why we've deliberately framed the choice as defaulting to querying in place. The reason is simple and comes to us courtesy of John Gail, who in 1975 theorized, "A complex system that works is invariably found to have evolved from a simple system that worked. A complex system designed from scratch never works and cannot be patched up to make it work. You have to start over with a working simple system." In many ways, querying the data in place can be viewed as the most straightforward starting point. Athena's scalability reduces the need to curate your data model to your access patterns highly. In Chapter...

Designing ETL queries for Athena

This section highlights workload traits and design considerations that Athena customers sometimes overlook creating ETL pipelines. Many of the items we are about to discuss are not specific to Athena. We'll be sure to note the ones that do stem from idiosyncrasies in the way Athena works. Generally speaking, there are no differences between regular Athena queries and those intended for use in an ETL pipeline. All of the performance suggestions covered in Chapter 2, Introduction to Amazon Athena, apply, and all the same Athena features are applicable across ad hoc analytics, ETL, and other use cases.

Don't forget about performance

Since ETL is not expected to be an interactive process, it allows us to run more time-consuming operations than we might otherwise. Just because ETL is typically viewed as an offline or asynchronous process that doesn't have a human sitting at a screen waiting for a response doesn't mean you can ignore...

Using Lambda as an orchestrator

An AWS Lambda function is an ideal orchestrator for simple ETL processes that run for 15 minutes or less and can be triggered by an event stream. If the number of steps, dependencies, or runtime grows, you'll want to consider using a more fully-featured orchestrator, such as AWS Managed WorkFlows for Apache Airflow. Putting that aside, building your own, simpler, serverless ETL pipeline with Lambda as an orchestrator is a great way to learn what to look for in a good orchestrator.

In this section, we'll precisely do that. Imagine we work for a fictitious hedge fund that is reeling from the great meme stock uprising of early 2021. Due to recent market volatility, the firm's risk management department is requiring trading desks across the company to report their recent trades on an hourly basis. Unfortunately, each trading desk uses different specialized trading software with no common interface for data extraction. Luckily, the trading...

Triggering ETL queries with S3 notifications

Due to its low cost, high reliability, and seemingly infinite scalability, Amazon S3 is often at the center of many cloud architectures. In 2014, this led the S3 team to add the ability to trigger events for operations on your objects. These events can be filtered by bucket, prefix, and operation type with possible destinations, including Simple Queue Service (SQS), Simple Notification Service (SNS), and Lambda. You may also be interested to know that S3 does not charge for this feature. You'll only pay for the associated SQS, SNS, or Lambda usage for processing the events.

As we said earlier, we want our ETL process to react to the arrival of new data without the need to wait or poll. This reduces latency and increases data freshness for time-sensitive workloads such as our trade summary reports. The integration between S3 events and AWS Lambda also automatically handles re-driving failed events, simplifying our error handling...

Summary

In this chapter, you learned about common usages of the ETL pattern, including integration, aggregation, modularization, and performance. The integration patterns offer a lowest-common-denominator approach to connecting disparate systems, even if they have no native support for integrating with each other. ETL for aggregations helps produce a single source of truth (SSOT) for getting a view of data across your estate. This is a common pattern for creating data lakes that work with services such as Athena. Modularization is an approach for using ETL to break up monolithic processes that are difficult to maintain or operationally prone to failure. Lastly, ETL for performance is a technique that moves expensive or time-consuming processing out of the live query path by either creating materialized views or running other pre-computations of anticipated workloads.

Armed with this knowledge of ETL design patterns, you reviewed key criteria for designing ETL queries for use with...

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Serverless Analytics with Amazon Athena
Published in: Nov 2021Publisher: PacktISBN-13: 9781800562349
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Authors (3)

author image
Anthony Virtuoso

Anthony Virtuoso works as a Principal Engineer at Amazon and holds multiple patents in distributed systems, software defined networks, and security. In his eight years at Amazon, he has helped launch several Amazon Web Services, the most recent of which was Amazon Managed Blockchain. As one of the original authors of Athena Query Federation, you'll often find him lurking on the Athena Federation GitHub repository answering questions and shipping bug fixes. When not at work, Anthony obsesses over a different set of customers, namely his wife and two little boys, aged 2 and 5. His kids enjoy doing science experiments with dad, like 3D printing toys, building with Lego, or searching the local pond for tardigrades.
Read more about Anthony Virtuoso

author image
Mert Turkay Hocanin

Mert Turkay Hocanin is a Principal Big Data Architect at Amazon Web Services within the AWS Glue and AWS Lake Formation services and has previously worked for several other services including Amazon Athena, Amazon EMR, Amazon Managed Blockchain. During his time at AWS, he worked with several Fortune 500 companies on some of the largest data lakes in the world and was involved with the launching of three Amazon Web Services. Prior to being a Big Data Architect, he was a Senior Software Developer within Amazon's retail systems organization building one of the earliest data lakes in the company in 2013. When he is not helping customers build data lakes, he enjoys spending time with his wife-Subrina, son-Tristan, and exploring New York City.
Read more about Mert Turkay Hocanin

author image
Aaron Wishnick

Aaron Wishnick works as a Senior Software Engineer at Amazon, where he has been for 7 years. During that time he has worked on Amazon's payment systems, financial intelligence systems, as well as working for AWS on Athena and AWS Proton. When not at work, Aaron and his fiance, Alyssa, are on a quest to determine just how much dog fur is too much, with their husky and malamute, Mina and Wally.
Read more about Aaron Wishnick