Reader small image

You're reading from  Simplify Big Data Analytics with Amazon EMR

Product typeBook
Published inMar 2022
PublisherPackt
ISBN-139781801071079
Edition1st Edition
Tools
Concepts
Right arrow
Author (1)
Sakti Mishra
Sakti Mishra
author image
Sakti Mishra

Sakti Mishra is an engineer, architect, author, and technology leader with over 16 years of experience in the IT industry. He is currently working as a senior data lab architect at Amazon Web Services (AWS). He is passionate about technologies and has expertise in big data, analytics, machine learning, artificial intelligence, graph networks, web/mobile applications, and cloud technologies such as AWS and Google Cloud Platform. Sakti has a bachelor’s degree in engineering and a master’s degree in business administration. He holds several certifications in Hadoop, Spark, AWS, and Google Cloud. He is also an author of multiple technology blogs, workshops, white papers and is a public speaker who represents AWS in various domains and events.
Read more about Sakti Mishra

Right arrow

Chapter 10: Implementing Real-Time Streaming with Amazon EMR and Spark Streaming

In Chapter 3, Common Use Cases and Architecture Patterns, we discussed different use cases and architecture patterns that you can follow using Amazon EMR, while in Chapter 9,Implementing Batch ETL Pipeline with Amazon EMR and Apache Spark, you learned how you can implement a batch Extract, Transform, and Load (ETL) pipeline using Amazon EMR and PySpark script.

In this chapter, we will dive deep into another use case – real-time streaming with Amazon EMR and Spark Streaming, where we will look at the implementation steps that you can follow to replicate the setup in your AWS account.

Real-time streaming use cases are becoming more popular as distributed processing engines such as Spark can stream, transform in real time, and help drive business decisions through real-time business intelligence (BI) reporting. This sample use case implementation learning will provide you with a starting point...

Technical requirements

In this chapter, we will implement a real-time streaming pipeline using AWS analytics services. So, before getting started, you need to make sure that you have the following requirements ready:

  • An AWS account with access to create Amazon S3, Amazon EMR, Amazon Athena, Amazon Cognito, and AWS Glue Catalog resources.
  • An IAM user who has access to create IAM roles, which will be used to trigger AWS CloudFormation stack or execute jobs.

Refer to the following link for access to the book's GitHub repository: https://github.com/PacktPublishing/Simplify-Big-Data-Analytics-with-Amazon-EMR-/tree/main/chapter_10.

Now, let's dive deep into the use case and the hands-on implementation steps.

Check out the following video to see the Code in Action at https://bit.ly/3oIz89Q

Use case and architecture overview

For this use case, let's assume you have a consumer-facing website where users are interacting with your web pages by clicking different buttons or links, which are specific to different page navigations, signing up, signing in, or buying products. You have started a promotional sale on your website for a limited duration and you would like to track how users are reacting to it in real time.

To track user activity, your frontend application has integrated click events, which will publish a JSON event string with every mouse click to a message bus such as Amazon Kinesis Data Streams or Kafka. From the message bus, a Spark Streaming-based consumer application will read JSON messages as a micro-batch and write to Amazon S3 data lake for real-time analysis.

To replicate the streaming of click events, we will integrate the Kinesis Data Generator web UI tool, where you can configure a sample JSON event and schedule it to publish a fixed number...

Implementation steps

In this section, we will guide you through the implementation steps for the use case and architecture we explained in the previous section.

Important Note

While explaining the implementation steps, we have used us-east-1 as the AWS region. You can use the same or an alternate region as per your choice. Please check any resource or service limits that might apply to your AWS region before proceeding with the implementation.

Creating Amazon S3 buckets

Let's first create the Amazon S3 buckets, which will be used by the EMR Spark job to write the streaming data. Please refer to the following steps to create them:

  1. Navigate to the Amazon S3 console at https://s3.console.aws.amazon.com/s3/home?region=us-east-1#.
  2. From the buckets list, choose the Create bucket option, which will open a form on the web interface to provide your bucket name and related configurations.

We have specified the bucket name as clickstream-events and kept everything...

Validating output using Amazon Athena

The Parquet format data is already available in Amazon S3 partition columns, but to make it more consumable for data analysts or data scientists, it would be great if we can enable querying the data through SQL by making it available as a database table.

To make that integration, we will follow a two-step approach:

  1. First, we will run Glue Crawler to create a Glue Catalog table on top of the S3 data.
  2. Then, we will run a query in Athena to validate the output.

Let's see how you can integrate that.

Defining a virtual Glue Catalog table on top of Amazon S3 data

You can follow these steps to create and run Glue Crawler, which will create a Glue Data Catalog table:

  1. Navigate to AWS Glue Crawler at https://console.aws.amazon.com/glue/home?region=us-east-1#catalog:tab=crawlers.
  2. Then, click Add crawler, which will open a form to configure the crawler.
  3. Configure the crawler, where the data source should...

Spark Streaming code walk-through

You can download the complete PySpark script from our GitHub repository. The following is a walk-through of the primary functions of the script.

The following getSparkSessionInstance() function is a user-defined function that gets an existing SparkSession, instead of creating a duplicate instance within custom, user-defined functions:

# Get existing SparkSession
def getSparkSessionInstance(sparkConf):
  if ("sparkSessionSingletonInstance" not in globals()):
    globals()["sparkSessionSingletonInstance"] = SparkSession.builder.config(conf=sparkConf).getOrCreate()
  return globals()["sparkSessionSingletonInstance"]

The following processRecords() function is a user-defined function, which is being invoked by each RDD of the Kinesis stream to parse the records of the RDD and write to Amazon S3 in Parquet format with year, month, date, and hour partition columns:

# Process...

Summary

Over the course of this chapter, we have dived deep into a real-time streaming use case, where we have integrated the data pipeline with Amazon S3, Amazon EMR, AWS Glue, and Amazon Athena.

We have covered detailed implementation steps, which you can follow to replicate the same or customize as per your use case. For our implementation, we have leveraged the Kinesis Data Generator UI tool to replicate clickstream data generation and push to Kinesis Data Streams. During your production implementation, your web application should push data to Kinesis Data Streams in real time.

At the end, we provided an overview of a few important parts of the EMR PySpark script, which can provide you with a starting point.

That concludes this chapter! Hopefully, this helped you get an idea of how you can integrate real-time streaming pipelines, and, in the next chapter, we will integrate another use case that implements UPSERT or MERGE in a data lake using the Apache Hudi framework...

Test your knowledge

Before moving on to the next chapter, test your knowledge with the following questions:

  1. Assume that the volume of data you receive in every micro batch of the stream is very small (in KB) and, in your data lake, you plan to maintain a minimum 64-128 MB file size for better read performance. How should you design the pipeline and what trade-offs should you consider?
  2. Assume, owing to infrastructure failures, that your EMR cluster got terminated but your source application is still continuously sending events to Kinesis Data Streams. When you restart your EMR cluster to resume the flow, how would you make sure that you do not lose any messages while processing the data using Spark?

Further reading

The following are a few resources you can refer to for further reading:

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Simplify Big Data Analytics with Amazon EMR
Published in: Mar 2022Publisher: PacktISBN-13: 9781801071079
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Sakti Mishra

Sakti Mishra is an engineer, architect, author, and technology leader with over 16 years of experience in the IT industry. He is currently working as a senior data lab architect at Amazon Web Services (AWS). He is passionate about technologies and has expertise in big data, analytics, machine learning, artificial intelligence, graph networks, web/mobile applications, and cloud technologies such as AWS and Google Cloud Platform. Sakti has a bachelor’s degree in engineering and a master’s degree in business administration. He holds several certifications in Hadoop, Spark, AWS, and Google Cloud. He is also an author of multiple technology blogs, workshops, white papers and is a public speaker who represents AWS in various domains and events.
Read more about Sakti Mishra