You're reading from Simplify Big Data Analytics with Amazon EMR

Product type Book

Published in Mar 2022

Publisher Packt

ISBN-13 9781801071079

Pages 430 pages

Edition 1st Edition

Languages

Concepts

Big Data

Author (1):

Sakti Mishra

Table of Contents (19) Chapters

Preface

Section 1: Overview, Architecture, Big Data Applications, and Common Use Cases of Amazon EMR

Chapter 1: An Overview of Amazon EMR

Chapter 2: Exploring the Architecture and Deployment Options

Chapter 3: Common Use Cases and Architecture Patterns

Chapter 4: Big Data Applications and Notebooks Available in Amazon EMR

Section 2: Configuration, Scaling, Data Security, and Governance

Chapter 5: Setting Up and Configuring EMR Clusters

Chapter 6: Monitoring, Scaling, and High Availability

Chapter 7: Understanding Security in Amazon EMR

Chapter 8: Understanding Data Governance in Amazon EMR

Section 3: Implementing Common Use Cases and Best Practices

Chapter 9: Implementing Batch ETL Pipeline with Amazon EMR and Apache Spark

Chapter 10: Implementing Real-Time Streaming with Amazon EMR and Spark Streaming

Chapter 11: Implementing UPSERT on S3 Data Lake with Apache Spark and Apache Hudi

Chapter 12: Orchestrating Amazon EMR Jobs with AWS Step Functions and Apache Airflow/MWAA

Chapter 13: Migrating On-Premises Hadoop Workloads to Amazon EMR

Chapter 14: Best Practices and Cost-Optimization Techniques

Other Books You May Enjoy

Chapter 9: Implementing Batch ETL Pipeline with Amazon EMR and Apache Spark

In Chapter 2, Exploring the Architecture and Deployment Options, you learned about different EMR use cases such as batch Extract, Transform, and Load (ETL), real-time streaming with EMR and Spark streaming, data preparation for machine learning (ML) models, interactive analytics, and more.

In this chapter, we will dive deep into a use case – Batch ETL with Amazon EMR and Apache Spark, where we will look at the implementation steps that you can follow to replicate the setup in your AWS account.

We will cover the following topics, which will help you understand the use case, its application architecture, and how a transient EMR cluster with Spark can be integrated for distributed processing:

Use case and architecture overview
Implementation steps
Validating output through Athena
Spark ETL and Lambda function code walk-through

Batch ETL is a common use case across many...

Technical requirements

In this chapter, we will implement a batch ETL pipeline using AWS services, so before getting started, make sure you have the following requirements:

An AWS account with access to create Amazon S3, AWS Lambda, Amazon EMR, Amazon Athena, and AWS Glue Data Catalog resources
An IAM user that has access to create IAM roles, which will be used to trigger or execute jobs
Access to the GitHub repository:

https://github.com/PacktPublishing/Simplify-Big-Data-Analytics-with-Amazon-EMR-/tree/main/chapter_09

Now let's dive deep into the use case and hands-on implementation steps.

Check out the following video to see the Code in Action at https://bit.ly/3LtLZGX

Use case and architecture overview

For this use case, let's assume you have a vendor who provides incremental sales data at the end of every day. The file arrives in S3 as CSV and it needs to be processed and made available to your data analysts for querying.

Your assignment is to build a data pipeline that automatically picks up the new sales file from the S3 input bucket, processes it with required transformations, and makes it available in the target S3 bucket, which will be used for querying. To implement this pipeline, you have planned to integrate a transient EMR cluster with Spark as the distributed processing engine. This EMR cluster is not active and gets created just before executing the job and gets terminated after completing the job.

Architecture overview

The following is the high-level architecture diagram of the data pipeline:

Figure 9.1 – Reference architecture diagram for a batch ETL pipeline

Here are the steps as shown...

Implementation steps

In this section, we will guide you through the implementation steps for the use case and architecture we explained in the previous section.

Important Note

Please note, while explaining the implementation steps, we have used us-east-1 as the AWS region. You can use the same or an alternate region as per your choice. Please check any resource or service limits that might apply to your AWS region before proceeding with the implementation.

Creating Amazon S3 buckets

Let's first create the Amazon S3 buckets and folders that will be used for both input and output. Please refer to the following steps to create them:

Navigate to the Amazon S3 console at https://s3.console.aws.amazon.com/s3/home?region=us-east-1#.
From the buckets list, choose Create Bucket, which will open up a form on the web interface to provide your bucket name and related configurations.

We have specified the input bucket name as raw-input and kept everything else...

Validating the output using Amazon Athena

The Parquet format data is already available in Amazon S3 with year and month partition, but to make it more consumable for data analysts or data scientists, it would be great if we could enable querying the data through SQL by making it available as a database table.

To make that integration, we can follow a two-step approach:

We can run the Glue crawler to create a Glue Data Catalog table on top of the S3 data.
We can run a query in Athena to validate the output.

Let's see how you can integrate that.

Defining a virtual Glue Data Catalog table on top of Amazon S3 data

You can follow these steps to create and run the Glue crawler, which will create a Glue Data Catalog table:

Navigate to the AWS Glue crawler at https://console.aws.amazon.com/glue/home?region=us-east-1#catalog:tab=crawlers.
Then click Add crawler, which will open up the form to configure the crawler.
Configure the crawler...

Spark ETL and Lambda function code walk-through

You can download the complete code from our GitHub repository specified in the Technical requirements section of the chapter. In this section, we will highlight a few sections of the code to explain its purpose and usage.

Understanding the AWS Lambda function code

The Lambda function's primary objective is to invoke EMR cluster launch and then submit a Spark step.

The following part of the code creates a boto3 client for the EMR service and invokes the run_job_flow method of it such that it takes all the required inputs for the cluster:

conn = boto3.client("emr", region_name=AWS_REGION)        
cluster_id = conn.run_job_flow(…)

The following parameters are passed to the run_job_flow method that specifies the EMR cluster configurations:

Instances={
    "Ec2KeyName": "<key-name>",
    "...

Summary

Over the course of this chapter, we have dived deep into a batch ETL use case, where we integrated the data pipeline with Amazon S3, AWS Lambda, Amazon EMR, AWS Glue, and Amazon Athena.

We have covered detailed implementation steps, which you can follow to replicate the steps or customize them as per your use case.

At the end of the chapter, we provided an overview of a few important parts of the AWS Lambda function and EMR PySpark script, which can provide you with a starting point for your projects.

That concludes this chapter! Hopefully, this helped you get an idea of how batch ETL pipelines can be integrated, and in the next chapter, we will integrate another use case, which is real-time streaming with Amazon EMR.

Test your knowledge

Before moving on to the next chapter, test your knowledge with the following questions:

Assume you have integrated the complete ETL pipeline but when your input file gets pushed to the input S3 bucket, the Lambda function does not launch the EMR cluster. When you plan to debug the Lambda function execution, you don't find any logs for the Lambda function in CloudWatch log groups. What might be the problem that stops the Lambda function from writing logs in CloudWatch and how would you resolve it?
Assume you have multiple data sources that are sending input files for processing. Instead of triggering an EMR cluster launch on an S3 file arrival event, you would like to schedule a PySpark job to run at a particular time of the day, so that it picks up all the input files available at that point of time for processing. How would you schedule the cluster creation and job execution?
You have integrated Amazon EMR for your batch analytics workload...

You're reading from Simplify Big Data Analytics with Amazon EMR

Table of Contents (19) Chapters

Chapter 9: Implementing Batch ETL Pipeline with Amazon EMR and Apache Spark

Technical requirements

Use case and architecture overview

Architecture overview

Implementation steps

Creating Amazon S3 buckets

Validating the output using Amazon Athena

Defining a virtual Glue Data Catalog table on top of Amazon S3 data

Spark ETL and Lambda function code walk-through

Understanding the AWS Lambda function code

Summary

Test your knowledge

Further reading

Authors (1)

Personalised recommendations for you

You're reading from Simplify Big Data Analytics with Amazon EMR

Table of Contents (19) Chapters

Chapter 9: Implementing Batch ETL Pipeline with Amazon EMR and Apache Spark

Technical requirements

Use case and architecture overview

Architecture overview

Implementation steps

Creating Amazon S3 buckets

Validating the output using Amazon Athena

Defining a virtual Glue Data Catalog table on top of Amazon S3 data

Spark ETL and Lambda function code walk-through

Understanding the AWS Lambda function code

Summary

Test your knowledge

Further reading

Unlock this book and the full library FREE for 7 days

Authors (1)

Personalised recommendations for you