You're reading from Simplify Big Data Analytics with Amazon EMR

Product typeBook

Published inMar 2022

PublisherPackt

ISBN-139781801071079

Edition1st Edition

Tools

AWS

Concepts

Big Data

Author (1)

Sakti Mishra

Chapter 10: Implementing Real-Time Streaming with Amazon EMR and Spark Streaming

In Chapter 3, Common Use Cases and Architecture Patterns, we discussed different use cases and architecture patterns that you can follow using Amazon EMR, while in Chapter 9,Implementing Batch ETL Pipeline with Amazon EMR and Apache Spark, you learned how you can implement a batch Extract, Transform, and Load (ETL) pipeline using Amazon EMR and PySpark script.

In this chapter, we will dive deep into another use case – real-time streaming with Amazon EMR and Spark Streaming, where we will look at the implementation steps that you can follow to replicate the setup in your AWS account.

Real-time streaming use cases are becoming more popular as distributed processing engines such as Spark can stream, transform in real time, and help drive business decisions through real-time business intelligence (BI) reporting. This sample use case implementation learning will provide you with a starting point...

Technical requirements

In this chapter, we will implement a real-time streaming pipeline using AWS analytics services. So, before getting started, you need to make sure that you have the following requirements ready:

An AWS account with access to create Amazon S3, Amazon EMR, Amazon Athena, Amazon Cognito, and AWS Glue Catalog resources.
An IAM user who has access to create IAM roles, which will be used to trigger AWS CloudFormation stack or execute jobs.

Refer to the following link for access to the book's GitHub repository: https://github.com/PacktPublishing/Simplify-Big-Data-Analytics-with-Amazon-EMR-/tree/main/chapter_10.

Now, let's dive deep into the use case and the hands-on implementation steps.

Check out the following video to see the Code in Action at https://bit.ly/3oIz89Q

Use case and architecture overview

For this use case, let's assume you have a consumer-facing website where users are interacting with your web pages by clicking different buttons or links, which are specific to different page navigations, signing up, signing in, or buying products. You have started a promotional sale on your website for a limited duration and you would like to track how users are reacting to it in real time.

To track user activity, your frontend application has integrated click events, which will publish a JSON event string with every mouse click to a message bus such as Amazon Kinesis Data Streams or Kafka. From the message bus, a Spark Streaming-based consumer application will read JSON messages as a micro-batch and write to Amazon S3 data lake for real-time analysis.

To replicate the streaming of click events, we will integrate the Kinesis Data Generator web UI tool, where you can configure a sample JSON event and schedule it to publish a fixed number...

Implementation steps

In this section, we will guide you through the implementation steps for the use case and architecture we explained in the previous section.

Important Note

While explaining the implementation steps, we have used us-east-1 as the AWS region. You can use the same or an alternate region as per your choice. Please check any resource or service limits that might apply to your AWS region before proceeding with the implementation.

Creating Amazon S3 buckets

Let's first create the Amazon S3 buckets, which will be used by the EMR Spark job to write the streaming data. Please refer to the following steps to create them:

Navigate to the Amazon S3 console at https://s3.console.aws.amazon.com/s3/home?region=us-east-1#.
From the buckets list, choose the Create bucket option, which will open a form on the web interface to provide your bucket name and related configurations.

We have specified the bucket name as clickstream-events and kept everything...

Validating output using Amazon Athena

The Parquet format data is already available in Amazon S3 partition columns, but to make it more consumable for data analysts or data scientists, it would be great if we can enable querying the data through SQL by making it available as a database table.

To make that integration, we will follow a two-step approach:

First, we will run Glue Crawler to create a Glue Catalog table on top of the S3 data.
Then, we will run a query in Athena to validate the output.

Let's see how you can integrate that.

Defining a virtual Glue Catalog table on top of Amazon S3 data

You can follow these steps to create and run Glue Crawler, which will create a Glue Data Catalog table:

Navigate to AWS Glue Crawler at https://console.aws.amazon.com/glue/home?region=us-east-1#catalog:tab=crawlers.
Then, click Add crawler, which will open a form to configure the crawler.
Configure the crawler, where the data source should...

Spark Streaming code walk-through

You can download the complete PySpark script from our GitHub repository. The following is a walk-through of the primary functions of the script.

The following getSparkSessionInstance() function is a user-defined function that gets an existing SparkSession, instead of creating a duplicate instance within custom, user-defined functions:

# Get existing SparkSession
def getSparkSessionInstance(sparkConf):
  if ("sparkSessionSingletonInstance" not in globals()):
    globals()["sparkSessionSingletonInstance"] = SparkSession.builder.config(conf=sparkConf).getOrCreate()
  return globals()["sparkSessionSingletonInstance"]

The following processRecords() function is a user-defined function, which is being invoked by each RDD of the Kinesis stream to parse the records of the RDD and write to Amazon S3 in Parquet format with year, month, date, and hour partition columns:

# Process...

Summary

Over the course of this chapter, we have dived deep into a real-time streaming use case, where we have integrated the data pipeline with Amazon S3, Amazon EMR, AWS Glue, and Amazon Athena.

We have covered detailed implementation steps, which you can follow to replicate the same or customize as per your use case. For our implementation, we have leveraged the Kinesis Data Generator UI tool to replicate clickstream data generation and push to Kinesis Data Streams. During your production implementation, your web application should push data to Kinesis Data Streams in real time.

At the end, we provided an overview of a few important parts of the EMR PySpark script, which can provide you with a starting point.

That concludes this chapter! Hopefully, this helped you get an idea of how you can integrate real-time streaming pipelines, and, in the next chapter, we will integrate another use case that implements UPSERT or MERGE in a data lake using the Apache Hudi framework...

Test your knowledge

Before moving on to the next chapter, test your knowledge with the following questions:

Assume that the volume of data you receive in every micro batch of the stream is very small (in KB) and, in your data lake, you plan to maintain a minimum 64-128 MB file size for better read performance. How should you design the pipeline and what trade-offs should you consider?
Assume, owing to infrastructure failures, that your EMR cluster got terminated but your source application is still continuously sending events to Kinesis Data Streams. When you restart your EMR cluster to resume the flow, how would you make sure that you do not lose any messages while processing the data using Spark?

Sakti Mishra is an engineer, architect, author, and technology leader with over 16 years of experience in the IT industry. He is currently working as a senior data lab architect at Amazon Web Services (AWS). He is passionate about technologies and has expertise in big data, analytics, machine learning, artificial intelligence, graph networks, web/mobile applications, and cloud technologies such as AWS and Google Cloud Platform. Sakti has a bachelor’s degree in engineering and a master’s degree in business administration. He holds several certifications in Hadoop, Spark, AWS, and Google Cloud. He is also an author of multiple technology blogs, workshops, white papers and is a public speaker who represents AWS in various domains and events.
Read more about Sakti Mishra

Personalised recommendations for you

Based on your interests and search pattern

Et al.

Ever wonder why speech recognition systems don't understand the Scottish accent, or what would happen if an astronaut only ate mac 'n' cheese, or other spurious reflections you'd have at a bar? We did, then collated those deliberations into absurd research articles with fake figures and methodologies inspired by even more fictionally absurd studies.

BookAug 2023230 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages4

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages1

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Mastering Tableau 2023

This book is a comprehensive resource to mastering your Tableau skills and becoming a BI expert. As you progress, you will learn how to build advanced dashboards and improve your storytelling to derive key business insight, as well as make you well-versed with advanced functionalities of Tableau in the business intelligence domain.

BookAug 2023684 pages

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages5

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages2

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

BookOct 2023636 pages5

Modern Data Architecture on AWS

Every organization wants an agile, performant, and cost-effective data platform that meets all their current and future business needs. Purpose-built AWS analytics services and their features play a big part in building such a modern data platform. This book brings to you all the design and architectural patterns that’ll help you achieve this goal.

BookAug 2023420 pages5

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

BookDec 2023240 pages

TinyML Cookbook

With over 70 project-based recipes, the TinyML Cookbook is a practical guide that will help you to get the most out of your microcontrollers. It provides a comprehensive understanding of the theoretical foundations while giving you hands-on experience training ML models for deployment on Arduino Nano 33 BLE Sense, Raspberry Pi Pico, and SparkFun RedBoard Artemis Nano microcontrollers.

BookNov 2023664 pages

You're reading from Simplify Big Data Analytics with Amazon EMR

Chapter 10: Implementing Real-Time Streaming with Amazon EMR and Spark Streaming

Technical requirements

Use case and architecture overview

Implementation steps

Creating Amazon S3 buckets

Validating output using Amazon Athena

Defining a virtual Glue Catalog table on top of Amazon S3 data

Spark Streaming code walk-through

Summary

Test your knowledge

Further reading

Unlock this book and the full library FREE for 7 days

Author (1)

Et al.

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

Mastering Tableau 2023

Building AI Applications with ChatGPT APIs

Building AI Applications with ChatGPT APIs

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

Modern Data Architecture on AWS

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

TinyML Cookbook