You're reading from Serverless Analytics with Amazon Athena

Product typeBook

Published inNov 2021

Reading LevelBeginner

PublisherPackt

ISBN-139781800562349

Edition1st Edition

Languages

Python

Tools

Amazon Athena

Concepts

Data Processing

Authors (3):

Anthony Virtuoso

Mert Turkay Hocanin

Aaron Wishnick

View More author details

Chapter 9: Serverless ETL Pipelines

In the previous chapter, you learned how to tame unstructured or loosely structured data using Athena to manipulate logs, JavaScript Object Notation (JSON), and other types of machine-generated data. In this chapter, we'll continue with the theme of controlling chaos by using automation to normalize newly arrived data through a process known as extract, transform, load (ETL). We start with a brief explanation of ETL, and once we've established a basic understanding of ETL processes, we will move on to best practices and common pitfalls of using Athena for ETL.

As with most of the chapters in this book, we'll then get hands-on by designing and implementing a serverless ETL pipeline. More precisely, we'll implement the serverless ETL pipeline discussed in Chapter 2, Introduction to Amazon Athena. In that chapter, we described a fictional hedge fund with a propensity for trading widely shorted meme stocks. Their equally fictional...

Technical requirements

Wherever possible, we will provide samples or instructions to guide you through the setup. However, to complete the activities in this chapter, you will need to ensure you have the following prerequisites available. Our command-line examples will be executed using Ubuntu, but most Linux flavors should work without modification, including Ubuntu on Windows Subsystem for Linux (WSL).

You will need internet access to GitHub, S3, and the Amazon Web Services (AWS) console.

You will also require a computer with the following installed:

Chrome, Safari, or Microsoft Edge browser
The AWS Command-Line Interface (CLI) installed

This chapter also requires you to have an AWS account and an accompanying Identity and Access Management (IAM) user (or role) with sufficient privileges to complete this chapter's activities. Throughout this book, we will provide detailed IAM policies that attempt to honor the age-old best practice of "least privilege...

Understanding the uses of ETL

In the most literal terms, ETL refers to a procedure with three conceptual phases that begin with reading data from a source system and end with a derivative of the original data being stored into a target system. In between these deceptively simple steps sits the most important facet of ETL, the transformation from the source system's semantic and physical schema to the domain model expected by the target system. In this step, we are essentially integrating source and target systems that may represent data differently.

Much of the academic literature on ETL points to the expansion of data warehousing concepts in the 1970s as its origin. It was a time when businesses rapidly adopted databases and found themselves with multiple data repositories, often using incompatible formats. Sounds familiar? Fast forward to today, and not much has changed aside from the date. The ability to integrate data from siloed or incompatible systems continues to be...

Deciding whether to ETL or query in place

The distinction between ETL and querying in place is blurred when using a service such as Athena. In the preceding sections, we reviewed common ETL use cases. In this section, we'll unpack the details that should go into deciding when the downsides of querying in place tilt the scale in favor of ETL. You might be curious why we've deliberately framed the choice as defaulting to querying in place. The reason is simple and comes to us courtesy of John Gail, who in 1975 theorized, "A complex system that works is invariably found to have evolved from a simple system that worked. A complex system designed from scratch never works and cannot be patched up to make it work. You have to start over with a working simple system." In many ways, querying the data in place can be viewed as the most straightforward starting point. Athena's scalability reduces the need to curate your data model to your access patterns highly. In Chapter...

Designing ETL queries for Athena

This section highlights workload traits and design considerations that Athena customers sometimes overlook creating ETL pipelines. Many of the items we are about to discuss are not specific to Athena. We'll be sure to note the ones that do stem from idiosyncrasies in the way Athena works. Generally speaking, there are no differences between regular Athena queries and those intended for use in an ETL pipeline. All of the performance suggestions covered in Chapter 2, Introduction to Amazon Athena, apply, and all the same Athena features are applicable across ad hoc analytics, ETL, and other use cases.

Don't forget about performance

Since ETL is not expected to be an interactive process, it allows us to run more time-consuming operations than we might otherwise. Just because ETL is typically viewed as an offline or asynchronous process that doesn't have a human sitting at a screen waiting for a response doesn't mean you can ignore...

Using Lambda as an orchestrator

An AWS Lambda function is an ideal orchestrator for simple ETL processes that run for 15 minutes or less and can be triggered by an event stream. If the number of steps, dependencies, or runtime grows, you'll want to consider using a more fully-featured orchestrator, such as AWS Managed WorkFlows for Apache Airflow. Putting that aside, building your own, simpler, serverless ETL pipeline with Lambda as an orchestrator is a great way to learn what to look for in a good orchestrator.

In this section, we'll precisely do that. Imagine we work for a fictitious hedge fund that is reeling from the great meme stock uprising of early 2021. Due to recent market volatility, the firm's risk management department is requiring trading desks across the company to report their recent trades on an hourly basis. Unfortunately, each trading desk uses different specialized trading software with no common interface for data extraction. Luckily, the trading...

Triggering ETL queries with S3 notifications

Due to its low cost, high reliability, and seemingly infinite scalability, Amazon S3 is often at the center of many cloud architectures. In 2014, this led the S3 team to add the ability to trigger events for operations on your objects. These events can be filtered by bucket, prefix, and operation type with possible destinations, including Simple Queue Service (SQS), Simple Notification Service (SNS), and Lambda. You may also be interested to know that S3 does not charge for this feature. You'll only pay for the associated SQS, SNS, or Lambda usage for processing the events.

As we said earlier, we want our ETL process to react to the arrival of new data without the need to wait or poll. This reduces latency and increases data freshness for time-sensitive workloads such as our trade summary reports. The integration between S3 events and AWS Lambda also automatically handles re-driving failed events, simplifying our error handling...

Summary

In this chapter, you learned about common usages of the ETL pattern, including integration, aggregation, modularization, and performance. The integration patterns offer a lowest-common-denominator approach to connecting disparate systems, even if they have no native support for integrating with each other. ETL for aggregations helps produce a single source of truth (SSOT) for getting a view of data across your estate. This is a common pattern for creating data lakes that work with services such as Athena. Modularization is an approach for using ETL to break up monolithic processes that are difficult to maintain or operationally prone to failure. Lastly, ETL for performance is a technique that moves expensive or time-consuming processing out of the live query path by either creating materialized views or running other pre-computations of anticipated workloads.

Armed with this knowledge of ETL design patterns, you reviewed key criteria for designing ETL queries for use with...

The rest of the chapter is locked

You have been reading a chapter from

Serverless Analytics with Amazon Athena

Published in: Nov 2021Publisher: PacktISBN-13: 9781800562349

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at $15.99/month. Cancel anytime

Authors (3)

Anthony Virtuoso

Anthony Virtuoso works as a Principal Engineer at Amazon and holds multiple patents in distributed systems, software defined networks, and security. In his eight years at Amazon, he has helped launch several Amazon Web Services, the most recent of which was Amazon Managed Blockchain. As one of the original authors of Athena Query Federation, you'll often find him lurking on the Athena Federation GitHub repository answering questions and shipping bug fixes. When not at work, Anthony obsesses over a different set of customers, namely his wife and two little boys, aged 2 and 5. His kids enjoy doing science experiments with dad, like 3D printing toys, building with Lego, or searching the local pond for tardigrades.
Read more about Anthony Virtuoso

Mert Turkay Hocanin

Mert Turkay Hocanin is a Principal Big Data Architect at Amazon Web Services within the AWS Glue and AWS Lake Formation services and has previously worked for several other services including Amazon Athena, Amazon EMR, Amazon Managed Blockchain. During his time at AWS, he worked with several Fortune 500 companies on some of the largest data lakes in the world and was involved with the launching of three Amazon Web Services. Prior to being a Big Data Architect, he was a Senior Software Developer within Amazon's retail systems organization building one of the earliest data lakes in the company in 2013. When he is not helping customers build data lakes, he enjoys spending time with his wife-Subrina, son-Tristan, and exploring New York City.
Read more about Mert Turkay Hocanin

Aaron Wishnick

Aaron Wishnick works as a Senior Software Engineer at Amazon, where he has been for 7 years. During that time he has worked on Amazon's payment systems, financial intelligence systems, as well as working for AWS on Athena and AWS Proton. When not at work, Aaron and his fiance, Alyssa, are on a quest to determine just how much dog fur is too much, with their husky and malamute, Mina and Wally.
Read more about Aaron Wishnick

Personalised recommendations for you

Based on your interests and search pattern

Et al.

Ever wonder why speech recognition systems don't understand the Scottish accent, or what would happen if an astronaut only ate mac 'n' cheese, or other spurious reflections you'd have at a bar? We did, then collated those deliberations into absurd research articles with fake figures and methodologies inspired by even more fictionally absurd studies.

BookAug 2023230 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages4

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages1

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Mastering Tableau 2023

This book is a comprehensive resource to mastering your Tableau skills and becoming a BI expert. As you progress, you will learn how to build advanced dashboards and improve your storytelling to derive key business insight, as well as make you well-versed with advanced functionalities of Tableau in the business intelligence domain.

BookAug 2023684 pages

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages5

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages2

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

BookOct 2023636 pages5

Modern Data Architecture on AWS

Every organization wants an agile, performant, and cost-effective data platform that meets all their current and future business needs. Purpose-built AWS analytics services and their features play a big part in building such a modern data platform. This book brings to you all the design and architectural patterns that’ll help you achieve this goal.

BookAug 2023420 pages5

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

BookDec 2023240 pages

TinyML Cookbook

With over 70 project-based recipes, the TinyML Cookbook is a practical guide that will help you to get the most out of your microcontrollers. It provides a comprehensive understanding of the theoretical foundations while giving you hands-on experience training ML models for deployment on Arduino Nano 33 BLE Sense, Raspberry Pi Pico, and SparkFun RedBoard Artemis Nano microcontrollers.

BookNov 2023664 pages