Reader small image

You're reading from  Serverless ETL and Analytics with AWS Glue

Product typeBook
Published inAug 2022
Reading LevelExpert
PublisherPackt
ISBN-139781800564985
Edition1st Edition
Languages
Right arrow
Authors (6):
Vishal Pathak
Vishal Pathak
author image
Vishal Pathak

Vishal Pathak is a Data Lab Solutions Architect at AWS. Vishal works with customers on their use cases, architects solutions to solve their business problems, and helps them build scalable prototypes. Prior to his journey in AWS, Vishal helped customers implement business intelligence, data warehouse, and data lake projects in the US and Australia.
Read more about Vishal Pathak

Subramanya Vajiraya
Subramanya Vajiraya
author image
Subramanya Vajiraya

Subramanya Vajiraya is a Big data Cloud Engineer at AWS Sydney specializing in AWS Glue. He obtained his Bachelor of Engineering degree specializing in Information Science & Engineering from NMAM Institute of Technology, Nitte, KA, India (Visvesvaraya Technological University, Belgaum) in 2015 and obtained his Master of Information Technology degree specialized in Internetworking from the University of New South Wales, Sydney, Australia in 2017. He is passionate about helping customers solve challenging technical issues related to their ETL workload and implementing scalable data integration and analytics pipelines on AWS.
Read more about Subramanya Vajiraya

Noritaka Sekiyama
Noritaka Sekiyama
author image
Noritaka Sekiyama

Noritaka Sekiyama is a Senior Big Data Architect on the AWS Glue and AWS Lake Formation team. He has 11 years of experience working in the software industry. Based in Tokyo, Japan, he is responsible for implementing software artifacts, building libraries, troubleshooting complex issues and helping guide customer architectures
Read more about Noritaka Sekiyama

Tomohiro Tanaka
Tomohiro Tanaka
author image
Tomohiro Tanaka

Tomohiro Tanaka is a senior cloud support engineer at AWS. He works to help customers solve their issues and build data lakes across AWS Glue, AWS IoT, and big data technologies such Apache Spark, Hadoop, and Iceberg.
Read more about Tomohiro Tanaka

Albert Quiroga
Albert Quiroga
author image
Albert Quiroga

Albert Quiroga works as a senior solutions architect at Amazon, where he is helping to design and architect one of the largest data lakes in the world. Prior to that, he spent four years working at AWS, where he specialized in big data technologies such as EMR and Athena, and where he became an expert on AWS Glue. Albert has worked with several Fortune 500 companies on some of the largest data lakes in the world and has helped to launch and develop features for several AWS services.
Read more about Albert Quiroga

Ishan Gaur
Ishan Gaur
author image
Ishan Gaur

Ishan Gaur has more than 13 years of IT experience in soft ware development and data engineering, building distributed systems and highly scalable ETL pipelines using Apache Spark, Scala, and various ETL tools such as Ab Initio and Datastage. He currently works at AWS as a senior big data cloud engineer and is an SME of AWS Glue. He is responsible for helping customers to build out large, scalable distributed systems and implement them in AWS cloud environments using various big data services, including EMR, Glue, and Athena, as well as other technologies, such as Apache Spark, Hadoop, and Hive.
Read more about Ishan Gaur

View More author details
Right arrow

Chapter 13: Data Analysis

In the previous chapter, we looked at the various buckets of Glue job expectation messages, why they occur, and how to handle them.

We learned about the impact of data skewness, how that can adversely impact job execution, and the techniques you can use to fix it. Additionally, we looked at some of the common reasons for Out-of-Memory (OOM) errors and the out-of-the-box mechanisms that are available in AWS Glue to handle them. Some of these tools and techniques can be used to be more effective in resource utilization in a pay-as-you-go cloud-native world. These techniques can not only be used for efficient processing but also help you reduce the processing time in a world that increasingly needs answers as quickly as possible.

But the question is, why put in all this effort? Why process data? This brings us to our current topic. One of the reasons for processing data is to analyze it. You might want to analyze the data to look at the larger picture or...

Creating Marketplace connections

We are going to create Marketplace connections for the Glue Hudi connector, the Glue Delta Lake connector, and the OpenSearch connector. We will be using these connectors in our code samples, and the names of these connectors will be used as input to the CloudFormation stack.

Creating the Glue Hudi connection

Let’s begin by creating the Glue Hudi connection:

  1. Navigate to AWS Marketplace (https://aws.amazon.com/marketplace/), search for the Apache Hudi Connector for AWS Glue product, and click on Continue to Subscribe:

Figure 13.1 – Subscribe to Apache Hudi Connector for AWS Glue

  1. Click on Accept Terms:

Figure 13.2 – Accept the terms

  1. After some time, when your request has been processed, the Continue to Configuration button will be enabled. Click on it:

Figure 13.3 – The Continue to Configuration button

  1. Select Glue 3.0 as the Fulfillment...

Creating the CloudFormation stack

First, let’s go through the prerequisites for this section.

Prerequisites for creating the CloudFormation stack

Make sure that the Amazon OpenSearch, Delta Lake, and Apache Hudi connections have been created. Also, make sure that you have a KeyPair. This KeyPair will be used to connect to one of the EC2 instances created by the CloudFormation template.

The CloudFormation template will create IAM roles and policies, too. These roles and policies are required for the jobs to function. Please review the definition of these roles, policies, networks, and security groups, and ensure that they align with the standards of your organization. In the following sections, first, we will create the stack and then create the dataset.

Creating the stack

The CloudFormation stack creates 61 resources. These resources can be found in the Resources tab of the CloudFormation stack.

Import the template in CloudFormation and enter the name of...

The benefit of ad hoc analysis and how a data lake enables it

Before the start of the data lake pattern, organizations used to offload their data into a data warehouse for analysis. This involved creating an Extraction, Transformation, and Load (ETL) pipe. Creating ETL pipes, moving the data into a warehouse, and creating reports take a substantial amount of time and resource investment. By the time all of this has finished, the requirements will have changed because of the change in the business over a period of time. Sometimes, business users discovered that they didn’t get what they ordered and that there was a gap in requirement and implementation.

For example, a business user could request sales data, resulting in the IT team moving the sales data into the warehouse. However, the sales data in the warehouse might not be of the grain that the business user needs or does not include the sales data from all the sources of sales information. All of this involves a massive...

Creating and updating Hudi tables using Glue

Apache Hudi is an open source data management tool that was initially developed by Uber. Its superpower is enabling incremental data processing in a data lake. The Apache Hudi format is supported by a wide range of tools on AWS such as AWS Glue, Amazon Redshift, Amazon Athena, and Amazon EMR.

The CloudFormation template, for this chapter, creates two Hudi batch jobs. They are 02 - Hudi Init load for Data Analysis Chapter and 03 - Hudi Incremental load for Data Analysis Chapter. Both of these jobs use the Hudi connection created in the Creating the Marketplace connections section. Additionally, these jobs accept the target bucket as an input parameter. This input parameter is prepopulated by the CloudFormation template. Navigate to the job details page of the 02 - Hudi Init load for Data Analysis Chapter job (https://console.aws.amazon.com/gluestudio/home?#/editor/job/02%20-%20Hudi%20Init%20load%20for%20Data%20Analysis%20Chapter/details...

Creating and updating Delta Lake tables using Glue

Delta Lake is also an open source framework that was initially developed by Databricks. Similar to Hudi, Delta Lake is also supported by Spark, Presto, and Hive among many others.

We will now execute the 04 - DeltaLake Init load for Data Analysis Chapter job to create a Delta Lake table. The 04 - DeltaLake Init load for Data Analysis Chapter job was created by the CloudFormation template executed earlier:

  1. Run the Glue job: 04 - DeltaLake Init load for Data Analysis Chapter. Notice in the job script that we are using Spark SQL to create a table definition in the Glue Catalog for the Delta Table. Here is the Spark SQL statement from the code of the 04 - DeltaLake Init load for Data Analysis Chapter job:
    spark.sql("CREATE TABLE `chapter-data-analysis-glue-database`.employees_deltalake (emp_no int, name string, department string, city string, salary int) ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe...

Inserting data into Lake Formation governed tables

Governed tables are packed with a lot of features such as ACID transactions, automatic data compaction for faster query response times, and time travel queries. Now we will go through the process of creating Lake Formation governed tables using Glue jobs:

  1. Go to the Outputs tab of the CloudFormation stack and grab the S3 path for the LakeFormationLocationForRegistry key.
  2. Go to AWS Lake Formation (https://console.aws.amazon.com/lakeformation/home) and register the S3 location, from step 1, with Lake Formation, as shown in the following screenshot:

Figure 13.24 – Registering the location

The format of this path is s3://<target_s3_bucket>/employees_governed_table/. Make sure that you register it in the same region where you created the Cloud Formation stack.

Note that you should use the AWSServiceRoleForLakeFormationDataAccess role. This role has been granted access to the KMS key...

Consuming streaming data using Glue

Now that we understand how Glue works in batch mode, let’s understand the process of updating the data coming through a stream.

The CloudFormation stack creates a Managed Streaming for Apache Kafka (MSK) cluster for this purpose. You will have to create a Glue connection for this MSK cluster. It is important that you name this connection as chapter-data-analysis-msk-connection. This connection is used in the jobs that follow. These jobs get the Kafka broker details from the connection.

Creating chapter-data-analysis-msk-connection

We will execute Glue jobs to load data into an MSK topic and also consume data from the topic. Both of these jobs require broker information and other details about the MSK cluster. Now we will create an MSK connection in Glue. Please ensure that you put the name of the connection as chapter-data-analysis-msk-connection. This is because the Glue jobs have been preconfigured to use this name as the connection...

Glue’s integration with OpenSearch

Now, let’s focus on a search use case. Let’s say that you were interested in searching through log data. Amazon OpenSearch could be your answer to that. Originally, it was forked from Elasticsearch and comes with a visualization technology called OpenSearch Dashboards. OpenSearch Dashboards has been forked from Kibana. OpenSearch can work on petabytes of unstructured and semi-structured data. Additionally, it can auto-tune itself and use ML to detect anomalies in real time. Auto-Tune analyzes cluster performance over time and suggests optimizations based on your workload.

For the purpose of this chapter, we will use our employee data as the source and show how we can load the data into OpenSearch. Then, we will visualize the data in OpenSearch Dashboards.

The CloudFormation template creates a secret that stores the OpenSearch domain’s user ID and password. The Marketplace connection created by you using the OpenSearch...

Cleaning up

Delete the CloudFormation stack and remove the registration of the S3 location in AWS Lake Formation along with the Data locations permissions that were granted manually for the governed tables part.

Summary

In this chapter, we learned how data in the data lake can be consumed through both Athena and Redshift. Then, we saw how we can create transactional lakes using technologies such as Hudi and Delta Lake. We then checked various mechanisms for consuming streaming sources in Glue using the forEachBatch method and Hudi DeltaStreamer. Finally, we checked how the ElasticSearch connector from the AWS Glue connector offerings can be used to push data into an OpenSearch domain and consumed through OpenSearch Dashboards. This chapter familiarized you with the most common patterns of data analysis and ETL using AWS Glue.

In the next chapter, we will learn about ML. We will find out more about the strengths and weaknesses of SparkML and SageMaker and when to use each of those tools.

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Serverless ETL and Analytics with AWS Glue
Published in: Aug 2022Publisher: PacktISBN-13: 9781800564985
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Authors (6)

author image
Vishal Pathak

Vishal Pathak is a Data Lab Solutions Architect at AWS. Vishal works with customers on their use cases, architects solutions to solve their business problems, and helps them build scalable prototypes. Prior to his journey in AWS, Vishal helped customers implement business intelligence, data warehouse, and data lake projects in the US and Australia.
Read more about Vishal Pathak

author image
Subramanya Vajiraya

Subramanya Vajiraya is a Big data Cloud Engineer at AWS Sydney specializing in AWS Glue. He obtained his Bachelor of Engineering degree specializing in Information Science & Engineering from NMAM Institute of Technology, Nitte, KA, India (Visvesvaraya Technological University, Belgaum) in 2015 and obtained his Master of Information Technology degree specialized in Internetworking from the University of New South Wales, Sydney, Australia in 2017. He is passionate about helping customers solve challenging technical issues related to their ETL workload and implementing scalable data integration and analytics pipelines on AWS.
Read more about Subramanya Vajiraya

author image
Noritaka Sekiyama

Noritaka Sekiyama is a Senior Big Data Architect on the AWS Glue and AWS Lake Formation team. He has 11 years of experience working in the software industry. Based in Tokyo, Japan, he is responsible for implementing software artifacts, building libraries, troubleshooting complex issues and helping guide customer architectures
Read more about Noritaka Sekiyama

author image
Tomohiro Tanaka

Tomohiro Tanaka is a senior cloud support engineer at AWS. He works to help customers solve their issues and build data lakes across AWS Glue, AWS IoT, and big data technologies such Apache Spark, Hadoop, and Iceberg.
Read more about Tomohiro Tanaka

author image
Albert Quiroga

Albert Quiroga works as a senior solutions architect at Amazon, where he is helping to design and architect one of the largest data lakes in the world. Prior to that, he spent four years working at AWS, where he specialized in big data technologies such as EMR and Athena, and where he became an expert on AWS Glue. Albert has worked with several Fortune 500 companies on some of the largest data lakes in the world and has helped to launch and develop features for several AWS services.
Read more about Albert Quiroga

author image
Ishan Gaur

Ishan Gaur has more than 13 years of IT experience in soft ware development and data engineering, building distributed systems and highly scalable ETL pipelines using Apache Spark, Scala, and various ETL tools such as Ab Initio and Datastage. He currently works at AWS as a senior big data cloud engineer and is an SME of AWS Glue. He is responsible for helping customers to build out large, scalable distributed systems and implement them in AWS cloud environments using various big data services, including EMR, Glue, and Athena, as well as other technologies, such as Apache Spark, Hadoop, and Hive.
Read more about Ishan Gaur