You're reading from Serverless ETL and Analytics with AWS Glue

Product typeBook

Published inAug 2022

Reading LevelExpert

PublisherPackt

ISBN-139781800564985

Edition1st Edition

Languages

Python

Tools

AWS Glue

Concepts

Data Analysis

Authors (6):

Vishal Pathak

Subramanya Vajiraya

Noritaka Sekiyama

Tomohiro Tanaka

Albert Quiroga

Ishan Gaur

View More author details

Chapter 10: Data Pipeline Management

Our data is composed of a lot of data types, such as IoT device logs, user logs, web server logs, and business reports. This data is generally stored in multiple data sources, such as relational databases, NoSQL databases, data warehouses, and data lakes, based on your applications, business needs, and rules. In this situation, there might be cases where you must obtain aggregated data results for user analysis, cost reports, and building machine learning models. To obtain the results, you may need to implement data processing flows to read data from multiple data sources by using a programming language, SQL, and so on. We usually call these flows data pipelines.

Recent pipeline flows consist of extracting data from data sources, transforming the data on computing engines, and loading the data into other data sources. This kind of pipeline is called an extract, transform, and load (ETL) pipeline, and it is used in a lot of cases. Additionally...

Technical requirements

For this chapter, if you wish to follow some of the walkthroughs, you will require the following:

Internet access to GitHub, S3, and the AWS console (specifically the console for AWS Glue, Amazon Step Functions, Amazon Managed Workflows for Apache Airflow, AWS CloudFormation, and Amazon S3)
A computer with Chrome, Firefox, Safari, or Microsoft Edge installed and the AWS Command Line Interface (AWS CLI)

Note

You can use not only the AWS CLI but also AWS CLI version 2. In this chapter, we have used the AWS CLI (not version 2). You can set up the AWS CLI (and version 2) by going to https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-getting-started.html.

You will also need an AWS account and an accompanying IAM user (or IAM role) with sufficient privileges to complete this chapter’s activities. We recommend using a minimally scoped IAM policy to avoid unnecessary usage and making operational mistakes. You can find...

What are data pipelines?

We generally use the word pipeline for a set of elements that are connected in a process, such as oil pipelines, gas pipelines, marketing pipelines, and so on. In particular, an element that is put into a pipeline is moved out via defined routes in a pipeline as output.

In computing, a data pipeline (or simply a pipeline) is referred to as a set of data processing elements that are connected in some series. Through a data pipeline, a set of elements are moved and transformed from various sources into destinations based on your implementation. A data pipeline usually consists of multiple tasks, such as data extraction, processing, validation, ingestion, pre-processing for machine learning use, and so on. Regarding the input and output of data pipelines, for example, the input is application logs, server logs, IoT device data, user data, and so on. The output of a data pipeline is analysis reports, a dataset for machine learning. The following diagram shows...

Selecting the appropriate data processing services for your analysis

One of the most important steps in using data processing pipelines is selecting the data processing services that meet the requirements for your data. In particular, you need to pay attention to the following:

Whether your computing engine can process the data with the fastest speed you can allow
Whether your computing engine can process all your data without any errors
Whether you can easily implement data processing
Whether the resource of your computing engine can easily be scaled as the amount of data increases (for example, you can scale it without making any changes to your code)

For example, if your data processing service doesn’t have more memory capacity than your data, what does the computing engine do to your job? Having less memory capacity can cause out-of-memory (OOM) issues in your processing jobs and cause job failures. Even if you can process the data with that small...

Orchestrating your pipelines with workflow tools

After selecting the data processing services for your data, you must build data processing pipelines using these services. For example, you can build a pipeline similar to the one shown in the following diagram. In this pipeline, four Glue Spark jobs extract the data from four databases. Then, each job writes data to S3. In terms of the data stored in S3, the next Glue Spark job processes the four tables’ data and generates an analytic report:

Figure 10.4 – A pipeline that extracts data from four databases, stores S3, and generates an analytic report by the aggregation job

So, after building a pipeline, how do you run each job? You can manually run multiple jobs to extract multiple databases. Once this has happened, you can run the job to generate a report. However, this can cause problems. One such problem is not getting a result if you run the generating report job before all the extracting jobs...

utomating how you provision your pipelines with provisioning tools

In the previous section, Orchestrating your pipelines with workflow tools, you learned how to orchestrate multiple pipelines and automate how they run with one tool. Using workflow tools for multiple pipelines can not only avoid human error but can also help you understand what pipelines do.

Note that as your system grows, you will build a lot of pipelines, and then you will build workflows to orchestrate them. If you have a lot of workflows as your system grows, you may need to consider how you should manage them. If you manually build several workflows and deploy them on your system, similar to how you would build and run pipelines manually, you may build some workflows that contain bugs. You can do this by specifying incorrect data sources, connecting incorrect pipeline jobs, and so on. As a result, this will corrupt your data and system, and pipeline job failures will occur due to broken workflows being deployed...

Developing and maintaining your data pipelines

Finally, let’s learn how to grow and maintain data pipelines. Your requirements and demands for data are always changing based on your company’s growth, market behaviors, business matters, technological shifts, and more. To meet the requirements and demands for data, you need to develop and update your data pipelines in a short period. Additionally, you need to care about the mechanism for detecting problems in your data pipeline implementations, safe pipeline deployment to avoid breaking your pipelines, and so on. For these considerations, you can apply the following system and concepts to your data pipeline development cycles. These are based on DevOps practices:

Version control systems (VCSs): You can track changes, roll back code, trigger tests, and so on. Git is one of the most popular VCSs (more precisely, a distributed VCS).
Continuous integration (CI): This is one of the software practices for building...

Summary

In this chapter, you learned how to build, manage, and maintain data pipelines. As the first step of constructing data pipelines, you need to choose your data processing services based on your company/organization/team, supported software, cost, your data schema/size/numbers, your data processing resource limit (memory and CPU), and so on.

After choosing the data processing service, you can run data pipeline flows using workflow tools. AWS Glue provides AWS Glue workflows as workflow tools. Other tools you can use for this process include AWS Step Functions and Amazon Managed Workflows for Apache Airflow. We looked at each tool by covering examples.

Then, you learned how to automate provisioning workflows and data pipelines with provisioning tools such as CloudFormation and AWS Glue Blueprints.

Finally, you learned how to develop and maintain workflows and data pipelines based on CI and CD. To achieve this, AWS provides a variety of developer tools such...

Examples of provisioning Glue resources by AWS CloudFormation:
- https://docs.aws.amazon.com/glue/latest/dg/populate-with-cloudformation-templates.html
- Build a serverless event-driven workflow with AWS Glue and Amazon Eventbridge: https://aws.amazon.com/jp/blogs/big-data/build-a-serverless-event-driven-workflow-with-aws-glue-and-amazon-eventbridge/
An example of creating workflows using AWS Glue and MWAA: https://aws.amazon.com/blogs/big-data/building-complex-workflows-with-amazon-mwaa-aws-step-functions-aws-glue-and-amazon-emr/

The rest of the chapter is locked

You have been reading a chapter from

Serverless ETL and Analytics with AWS Glue

Published in: Aug 2022Publisher: PacktISBN-13: 9781800564985

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at $15.99/month. Cancel anytime

Authors (6)

Vishal Pathak

Vishal Pathak is a Data Lab Solutions Architect at AWS. Vishal works with customers on their use cases, architects solutions to solve their business problems, and helps them build scalable prototypes. Prior to his journey in AWS, Vishal helped customers implement business intelligence, data warehouse, and data lake projects in the US and Australia.
Read more about Vishal Pathak

Subramanya Vajiraya

Subramanya Vajiraya is a Big data Cloud Engineer at AWS Sydney specializing in AWS Glue. He obtained his Bachelor of Engineering degree specializing in Information Science & Engineering from NMAM Institute of Technology, Nitte, KA, India (Visvesvaraya Technological University, Belgaum) in 2015 and obtained his Master of Information Technology degree specialized in Internetworking from the University of New South Wales, Sydney, Australia in 2017. He is passionate about helping customers solve challenging technical issues related to their ETL workload and implementing scalable data integration and analytics pipelines on AWS.
Read more about Subramanya Vajiraya

Noritaka Sekiyama

Noritaka Sekiyama is a Senior Big Data Architect on the AWS Glue and AWS Lake Formation team. He has 11 years of experience working in the software industry. Based in Tokyo, Japan, he is responsible for implementing software artifacts, building libraries, troubleshooting complex issues and helping guide customer architectures
Read more about Noritaka Sekiyama

Tomohiro Tanaka

Tomohiro Tanaka is a senior cloud support engineer at AWS. He works to help customers solve their issues and build data lakes across AWS Glue, AWS IoT, and big data technologies such Apache Spark, Hadoop, and Iceberg.
Read more about Tomohiro Tanaka

Albert Quiroga

Albert Quiroga works as a senior solutions architect at Amazon, where he is helping to design and architect one of the largest data lakes in the world. Prior to that, he spent four years working at AWS, where he specialized in big data technologies such as EMR and Athena, and where he became an expert on AWS Glue. Albert has worked with several Fortune 500 companies on some of the largest data lakes in the world and has helped to launch and develop features for several AWS services.
Read more about Albert Quiroga

Ishan Gaur

Ishan Gaur has more than 13 years of IT experience in soft ware development and data engineering, building distributed systems and highly scalable ETL pipelines using Apache Spark, Scala, and various ETL tools such as Ab Initio and Datastage. He currently works at AWS as a senior big data cloud engineer and is an SME of AWS Glue. He is responsible for helping customers to build out large, scalable distributed systems and implement them in AWS cloud environments using various big data services, including EMR, Glue, and Athena, as well as other technologies, such as Apache Spark, Hadoop, and Hive.
Read more about Ishan Gaur

Personalised recommendations for you

Based on your interests and search pattern

Et al.

Ever wonder why speech recognition systems don't understand the Scottish accent, or what would happen if an astronaut only ate mac 'n' cheese, or other spurious reflections you'd have at a bar? We did, then collated those deliberations into absurd research articles with fake figures and methodologies inspired by even more fictionally absurd studies.

BookAug 2023230 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages4

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages1

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Mastering Tableau 2023

This book is a comprehensive resource to mastering your Tableau skills and becoming a BI expert. As you progress, you will learn how to build advanced dashboards and improve your storytelling to derive key business insight, as well as make you well-versed with advanced functionalities of Tableau in the business intelligence domain.

BookAug 2023684 pages

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages5

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages2

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

BookOct 2023636 pages5

Modern Data Architecture on AWS

Every organization wants an agile, performant, and cost-effective data platform that meets all their current and future business needs. Purpose-built AWS analytics services and their features play a big part in building such a modern data platform. This book brings to you all the design and architectural patterns that’ll help you achieve this goal.

BookAug 2023420 pages5

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

BookDec 2023240 pages

TinyML Cookbook

With over 70 project-based recipes, the TinyML Cookbook is a practical guide that will help you to get the most out of your microcontrollers. It provides a comprehensive understanding of the theoretical foundations while giving you hands-on experience training ML models for deployment on Arduino Nano 33 BLE Sense, Raspberry Pi Pico, and SparkFun RedBoard Artemis Nano microcontrollers.

BookNov 2023664 pages

You're reading from Serverless ETL and Analytics with AWS Glue

Chapter 10: Data Pipeline Management

Technical requirements

What are data pipelines?

Selecting the appropriate data processing services for your analysis

Orchestrating your pipelines with workflow tools

utomating how you provision your pipelines with provisioning tools

Developing and maintaining your data pipelines

Summary

Further reading

Unlock this book and the full library FREE for 7 days

Authors (6)

Et al.

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

Mastering Tableau 2023

Building AI Applications with ChatGPT APIs

Building AI Applications with ChatGPT APIs

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

Modern Data Architecture on AWS

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

TinyML Cookbook