Search icon
Arrow left icon
All Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletters
Free Learning
Arrow right icon
Serverless ETL and Analytics with AWS Glue

You're reading from  Serverless ETL and Analytics with AWS Glue

Product type Book
Published in Aug 2022
Publisher Packt
ISBN-13 9781800564985
Pages 434 pages
Edition 1st Edition
Languages
Authors (6):
Vishal Pathak Vishal Pathak
Profile icon Vishal Pathak
Subramanya Vajiraya Subramanya Vajiraya
Profile icon Subramanya Vajiraya
Noritaka Sekiyama Noritaka Sekiyama
Profile icon Noritaka Sekiyama
Tomohiro Tanaka Tomohiro Tanaka
Profile icon Tomohiro Tanaka
Albert Quiroga Albert Quiroga
Profile icon Albert Quiroga
Ishan Gaur Ishan Gaur
Profile icon Ishan Gaur
View More author details

Table of Contents (20) Chapters

Preface Section 1 – Introduction, Concepts, and the Basics of AWS Glue
Chapter 1: Data Management – Introduction and Concepts Chapter 2: Introduction to Important AWS Glue Features Chapter 3: Data Ingestion Section 2 – Data Preparation, Management, and Security
Chapter 4: Data Preparation Chapter 5: Data Layouts Chapter 6: Data Management Chapter 7: Metadata Management Chapter 8: Data Security Chapter 9: Data Sharing Chapter 10: Data Pipeline Management Section 3 – Tuning, Monitoring, Data Lake Common Scenarios, and Interesting Edge Cases
Chapter 11: Monitoring Chapter 12: Tuning, Debugging, and Troubleshooting Chapter 13: Data Analysis Chapter 14: Machine Learning Integration Chapter 15: Architecting Data Lakes for Real-World Scenarios and Edge Cases Other Books You May Enjoy

Chapter 3: Data Ingestion

In the previous chapter, we discussed the fundamental concepts and inner workings of the various features/microservices that are available in AWS Glue, such as Glue Data Catalog, connections, crawlers, and classifiers, the schema registry, Glue ETL jobs, development endpoints, interactive sessions, and triggers. We also explored how AWS Glue crawlers aid in data discovery by crawling different types of data stores – Amazon S3, JDBC (Amazon RDS or on-premises databases), and DynamoDB/MongoDB/DocumentDB infer the schema and populate AWS Glue Data Catalog. While discussing Glue ETL in the previous chapter, we introduced a few of the important extensions/features of Spark ETL, including GlueContext, DynamicFrame, JobBookmark, and GlueParquet. In this chapter, we will see them in action by looking at some examples.

In this chapter, we will be discussing some of the components of AWS Glue mentioned in the previous paragraph – specifically Glue...

Technical requirements

To get started with this chapter, you will need a workstation that’s running Linux, macOS, or Windows with at least 7 GB of storage and 4 GB of RAM. While the code snippets can be run directly on AWS Glue (an AWS account is required to access AWS Glue), you can still run most of the code snippets in this chapter on your workstation directly. The code snippets in this chapter are available in this book’s GitHub repository at https://github.com/PacktPublishing/Serverless-ETL-and-Analytics-with-AWS-Glue/tree/main/Chapter03.

There are several options available for setting up the Glue development environment on your workstation. Please refer to the AWS Glue documentation at https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-libraries.html for instructions regarding each of those options.

Now, let’s explore how we can ingest data from different types of data stores one by one.

Data ingestion from file/object stores

This is one of the most common use cases for Glue ETL, where the source data is already available in file storage or cloud-based object stores. Here, depending on the type of job being executed, the methods or libraries used to access the data store differ.

There are several file/object storage services available today – Amazon S3, HDFS, Azure Storage, Google Cloud Storage, IBM Cloud Object Storage, FTP, SFTP, and HTTP(s) to name a few. In this section, we will focus on two of the most popular file/object stores that are used with AWS Glue – Amazon S3 and HDFS.

Data ingestion from Amazon S3

Data ingestion from Amazon S3 is by far the most commonly used design pattern for ETL in AWS Glue. Most organizations already have some mechanism to move data to Amazon S3, typically by using the AWS CLI/SDKs directly, AWS Transfer Family (https://aws.amazon.com/aws-transfer-family/), or some other third-party tools.

If we are using...

Data ingestion from JDBC data stores

For many organizations hydrating data lakes by ingesting the data from OLTP, data stores are the primary use case for using ETL tools/frameworks. Typically, these ETL jobs are run periodically to keep the data lake up to date. As discussed in Chapter 1, Data Management - Introduction and Concepts, there are quite a few options available in AWS to achieve this outcome. The most popular ones are AWS DMS and AWS Glue.

Users can set up AWS DMS replication instances to capture ongoing changes from the source data store. At the time of writing, this feature supports Microsoft SQL Server, PostgreSQL, Oracle, and MySQL databases. Please refer to the AWS DMS documentation at https://docs.aws.amazon.com/dms/latest/userguide/CHAP_Task.CDC.html for more information on this feature.

Another option is to use AWS Glue Spark ETL to read JDBC data stores and move the data to Amazon S3 or other target data stores supported by Apache Spark. With this option...

Data ingestion from streaming data sources

We explored fundamental concepts regarding data ingestion from streaming data sources in the previous chapter when we discussed AWS Glue Schema Registry (GSR). In this section, we will learn how to implement data ingestion from streaming data sources such as Amazon Kinesis and Apache Kafka using AWS Glue Spark ETL.

Stream processing can be defined as the act of continuously incorporating new data to compute a result wherein the input data is unbounded and has no predetermined beginning or end. Apache Spark has two components for stream processing: Spark Streaming and Structured Streaming.

According to the Apache Spark documentation (https://spark.apache.org/docs/3.1.1/streaming-programming-guide.html), “Spark Streaming is an extension of the core Spark API that enables scalable, high-throughput, fault-tolerant stream processing of live data streams.

Spark Streaming introduces a high-level abstraction layer called...

Data ingestion from SaaS data stores

So far, we have explored ways to ingest data from file/object stores, JDBC, and streaming data sources using AWS Glue ETL. Apart from these methods, organizations can take advantage of Marketplace connectors or create their own connectors to ingest data from a data store that is not directly supported by AWS Glue ETL. This feature was added to AWS Glue as part of the Glue Studio release in December 2020.

For example, with this new capability, we can take advantage of connectors for Salesforce, SAP, and Snowflake. If a connector is not readily available in AWS Marketplace, we can build custom connectors so that we can integrate custom-built Spark connectors and Athena Federated Query connectors into our ETL jobs.

Connectors for popular data stores such as Snowflake, SAP, Salesforce, Apache Hudi, Google BigQuery, Delta Lake, Elasticsearch, and CloudWatch Logs are readily available on AWS Marketplace. Depending on the publisher of a given connector...

Summary

In this chapter, we discussed the methods and different optimization features that can be used in AWS Glue ETL to ingest data from file/object stores, JDBC-compatible data stores, and streaming data stores. We also explored serialization and deserialization, which are used by AWS GSR to handle evolving schemas. Then, we introduced Glue Studio Marketplace connectors, using which we can ingest data from SaaS. Finally, we briefly discussed how users can build custom JDBC/Spark/Athena Federated Query connectors to ingest data from data stores that are not directly supported by AWS Glue and when there is no connector readily available in AWS Marketplace.

In the next chapter, we will be discussing data preparation strategies. We'll explore different factors that can be considered while choosing the right service/tool. We will also discuss the different available options: visual data preparation versus source code-/SQL-based data preparation and the different transformation...

lock icon The rest of the chapter is locked
You have been reading a chapter from
Serverless ETL and Analytics with AWS Glue
Published in: Aug 2022 Publisher: Packt ISBN-13: 9781800564985
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime}