You're reading from Serverless ETL and Analytics with AWS Glue

Product type Book

Published in Aug 2022

Publisher Packt

ISBN-13 9781800564985

Pages 434 pages

Edition 1st Edition

Languages

Python

Concepts

Data Analysis

Authors (6):

Vishal Pathak

Subramanya Vajiraya

Noritaka Sekiyama

Tomohiro Tanaka

Albert Quiroga

Ishan Gaur

View More author details

Table of Contents (20) Chapters

Preface

Section 1 – Introduction, Concepts, and the Basics of AWS Glue

Chapter 1: Data Management – Introduction and Concepts

Chapter 2: Introduction to Important AWS Glue Features

Chapter 3: Data Ingestion

Section 2 – Data Preparation, Management, and Security

Chapter 4: Data Preparation

Chapter 5: Data Layouts

Chapter 6: Data Management

Chapter 7: Metadata Management

Chapter 8: Data Security

Chapter 9: Data Sharing

Chapter 10: Data Pipeline Management

Section 3 – Tuning, Monitoring, Data Lake Common Scenarios, and Interesting Edge Cases

Chapter 11: Monitoring

Chapter 12: Tuning, Debugging, and Troubleshooting

Chapter 13: Data Analysis

Chapter 14: Machine Learning Integration

Chapter 15: Architecting Data Lakes for Real-World Scenarios and Edge Cases

Other Books You May Enjoy

Chapter 3: Data Ingestion

In the previous chapter, we discussed the fundamental concepts and inner workings of the various features/microservices that are available in AWS Glue, such as Glue Data Catalog, connections, crawlers, and classifiers, the schema registry, Glue ETL jobs, development endpoints, interactive sessions, and triggers. We also explored how AWS Glue crawlers aid in data discovery by crawling different types of data stores – Amazon S3, JDBC (Amazon RDS or on-premises databases), and DynamoDB/MongoDB/DocumentDB infer the schema and populate AWS Glue Data Catalog. While discussing Glue ETL in the previous chapter, we introduced a few of the important extensions/features of Spark ETL, including GlueContext, DynamicFrame, JobBookmark, and GlueParquet. In this chapter, we will see them in action by looking at some examples.

In this chapter, we will be discussing some of the components of AWS Glue mentioned in the previous paragraph – specifically Glue...

Technical requirements

To get started with this chapter, you will need a workstation that’s running Linux, macOS, or Windows with at least 7 GB of storage and 4 GB of RAM. While the code snippets can be run directly on AWS Glue (an AWS account is required to access AWS Glue), you can still run most of the code snippets in this chapter on your workstation directly. The code snippets in this chapter are available in this book’s GitHub repository at https://github.com/PacktPublishing/Serverless-ETL-and-Analytics-with-AWS-Glue/tree/main/Chapter03.

There are several options available for setting up the Glue development environment on your workstation. Please refer to the AWS Glue documentation at https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-libraries.html for instructions regarding each of those options.

Now, let’s explore how we can ingest data from different types of data stores one by one.

Data ingestion from file/object stores

This is one of the most common use cases for Glue ETL, where the source data is already available in file storage or cloud-based object stores. Here, depending on the type of job being executed, the methods or libraries used to access the data store differ.

There are several file/object storage services available today – Amazon S3, HDFS, Azure Storage, Google Cloud Storage, IBM Cloud Object Storage, FTP, SFTP, and HTTP(s) to name a few. In this section, we will focus on two of the most popular file/object stores that are used with AWS Glue – Amazon S3 and HDFS.

Data ingestion from Amazon S3

Data ingestion from Amazon S3 is by far the most commonly used design pattern for ETL in AWS Glue. Most organizations already have some mechanism to move data to Amazon S3, typically by using the AWS CLI/SDKs directly, AWS Transfer Family (https://aws.amazon.com/aws-transfer-family/), or some other third-party tools.

If we are using...

Data ingestion from JDBC data stores

For many organizations hydrating data lakes by ingesting the data from OLTP, data stores are the primary use case for using ETL tools/frameworks. Typically, these ETL jobs are run periodically to keep the data lake up to date. As discussed in Chapter 1, Data Management - Introduction and Concepts, there are quite a few options available in AWS to achieve this outcome. The most popular ones are AWS DMS and AWS Glue.

Users can set up AWS DMS replication instances to capture ongoing changes from the source data store. At the time of writing, this feature supports Microsoft SQL Server, PostgreSQL, Oracle, and MySQL databases. Please refer to the AWS DMS documentation at https://docs.aws.amazon.com/dms/latest/userguide/CHAP_Task.CDC.html for more information on this feature.

Another option is to use AWS Glue Spark ETL to read JDBC data stores and move the data to Amazon S3 or other target data stores supported by Apache Spark. With this option...

Data ingestion from streaming data sources

We explored fundamental concepts regarding data ingestion from streaming data sources in the previous chapter when we discussed AWS Glue Schema Registry (GSR). In this section, we will learn how to implement data ingestion from streaming data sources such as Amazon Kinesis and Apache Kafka using AWS Glue Spark ETL.

Stream processing can be defined as the act of continuously incorporating new data to compute a result wherein the input data is unbounded and has no predetermined beginning or end. Apache Spark has two components for stream processing: Spark Streaming and Structured Streaming.

According to the Apache Spark documentation (https://spark.apache.org/docs/3.1.1/streaming-programming-guide.html), “Spark Streaming is an extension of the core Spark API that enables scalable, high-throughput, fault-tolerant stream processing of live data streams.”

Spark Streaming introduces a high-level abstraction layer called...

Data ingestion from SaaS data stores

So far, we have explored ways to ingest data from file/object stores, JDBC, and streaming data sources using AWS Glue ETL. Apart from these methods, organizations can take advantage of Marketplace connectors or create their own connectors to ingest data from a data store that is not directly supported by AWS Glue ETL. This feature was added to AWS Glue as part of the Glue Studio release in December 2020.

For example, with this new capability, we can take advantage of connectors for Salesforce, SAP, and Snowflake. If a connector is not readily available in AWS Marketplace, we can build custom connectors so that we can integrate custom-built Spark connectors and Athena Federated Query connectors into our ETL jobs.

Connectors for popular data stores such as Snowflake, SAP, Salesforce, Apache Hudi, Google BigQuery, Delta Lake, Elasticsearch, and CloudWatch Logs are readily available on AWS Marketplace. Depending on the publisher of a given connector...

Summary

In this chapter, we discussed the methods and different optimization features that can be used in AWS Glue ETL to ingest data from file/object stores, JDBC-compatible data stores, and streaming data stores. We also explored serialization and deserialization, which are used by AWS GSR to handle evolving schemas. Then, we introduced Glue Studio Marketplace connectors, using which we can ingest data from SaaS. Finally, we briefly discussed how users can build custom JDBC/Spark/Athena Federated Query connectors to ingest data from data stores that are not directly supported by AWS Glue and when there is no connector readily available in AWS Marketplace.

In the next chapter, we will be discussing data preparation strategies. We'll explore different factors that can be considered while choosing the right service/tool. We will also discuss the different available options: visual data preparation versus source code-/SQL-based data preparation and the different transformation...