Reader small image

You're reading from  Simplify Big Data Analytics with Amazon EMR

Product typeBook
Published inMar 2022
PublisherPackt
ISBN-139781801071079
Edition1st Edition
Tools
Concepts
Right arrow
Author (1)
Sakti Mishra
Sakti Mishra
author image
Sakti Mishra

Sakti Mishra is an engineer, architect, author, and technology leader with over 16 years of experience in the IT industry. He is currently working as a senior data lab architect at Amazon Web Services (AWS). He is passionate about technologies and has expertise in big data, analytics, machine learning, artificial intelligence, graph networks, web/mobile applications, and cloud technologies such as AWS and Google Cloud Platform. Sakti has a bachelor’s degree in engineering and a master’s degree in business administration. He holds several certifications in Hadoop, Spark, AWS, and Google Cloud. He is also an author of multiple technology blogs, workshops, white papers and is a public speaker who represents AWS in various domains and events.
Read more about Sakti Mishra

Right arrow

Chapter 3: Common Use Cases and Architecture Patterns

This chapter provides an overview of common use cases and architecture patterns you will see with Amazon Elastic MapReduce (EMR) and how EMR integrates with different AWS services to solve specific use cases. The use cases include batch Extract, Transform, and Load (ETL), real-time streaming, clickstream analytics, interactive analytics with machine learning (ML), genomics data analysis, and log analytics.

This should give you a starting point to understand what problem statements you can solve using Amazon EMR and use it to solve your real-world big data use cases.

We will dive deep into the following topics in this chapter:

  • Reference architecture for batch ETL workloads
  • Reference architecture for clickstream analytics
  • Reference architecture for interactive analytics and ML
  • Reference architecture for real-time streaming analytics
  • Reference architecture for genomics data analytics
  • Reference architecture...

Reference architecture for batch ETL workloads

When data analysts receive data from different data sources, the first thing they do is transform it into a format that can be used for analysis or reporting. This data transformation process might involve several steps to bring it to the desired state and, after it is ready, you need to load it into a data warehousing system or data lake, which can be consumed by data analysts or data scientists.

To make the data available for consumption, you need to extract it from the source, transform it with different steps, and then load it into the target storage layer – hence the term ETL. For a few other use cases, when the raw data is in a structured format, you can then load it into a relational database or data warehouse and then transform it with SQL, where it becomes Extract, Load, and Transform (ELT).

What we understand from all this is that transformation is the primary piece that makes raw data ready for consumption. What...

Reference architecture for clickstream analytics

In consumer-facing applications, such as web applications or mobile applications, business owners are more interested in identifying metrics from a user's access patterns to derive insights into which products, services, or features users like more. This enables business leaders to make more accurate decisions. Often, it becomes a necessity to capture user actions or clicks in real time to have a real-time dashboard suggesting how successful your campaign is or how users are responding to your new product launch.

To make business decisions in real time, you need to have the data flow in near real time too. This means as soon as the user clicks anywhere within the application, you need to capture an event immediately and push it through your backend system for processing. As multiple users access your application through different channels, it generates a stream of events and you need a scalable architecture that can support...

Reference architecture for interactive analytics and ML

In the previous sections of this chapter, you might have seen the usage of Amazon EMR as a transient cluster that gets created through file arrival or a scheduled event, processes the file with Hive or Spark steps, and then gets terminated. Transient clusters are great to decouple storage and compute and also to save costs by reducing cluster idle time.

But there are few use cases where you might need a persistent EMR cluster that might be active 24x7 with minimal cluster node capacity and goes through the EMR autoscaling feature to scale up and down as needed. These persistent clusters generally serve multiple workloads, including ETL transformations with Hive/Spark, analyzing data through SQL-based query engines such as Hive and Presto, or interactive ML model development through notebooks. In a few cases, you can implement a multi-tenant EMR cluster that serves multiple teams with an access policy and data isolation.

...

Reference architecture for real-time streaming analytics

At the beginning of the chapter, you learned about clickstream analytics that integrated Amazon KDS and EMR with Spark Streaming to stream clickstream events in real time. The use case covered in this section is another use case that explains how you can stream Internet of Things (IoT) device events in real time to your data lake and data warehouse for real-time dashboards.

To give an overview of IoT, it's a network of physical objects called things that uses sensors and related software technologies to connect and exchange information with other devices or systems over the internet. These devices can be any household or industrial equipment that has a sensor and required software embedded into it to communicate with other devices or send messages to a central unit that monitors requests or signals.

The adoption of IoT around the world is increasing as analytics on device data can provide a lot of insights to optimize...

Reference architecture for genomics data analytics

Before going into the technical implementation details of genomics data analytics, let's understand what genomics means. It is a field of study of biology that focuses on the evolution, mapping, structure, and functions of genomes. A genome is a complete set of DNA of a living being, which includes all of its genes.

In recent times, there have been significant investments in genomics and clinical data to explore more about living beings' genes and their characteristics, which can help diagnose any disease beforehand or predict new features. Technology continues to play a vital role in genomics studies: as the data volume grows, you can use big data technologies for distributed processing.

Genomics datasets are available in complex data formats, such as VCF and gVCF, and to parse them, there are several popular frameworks available, such as Glow and Hail.

Use case overview

Let's assume your organization...

Reference architecture for log analytics

Log analytics is a common requirement in most enterprises. As you grow with multiple applications, jobs, or servers that produce enormous logs every day, it becomes essential to aggregate them for analysis.

There are several challenges in log analytics as you need to define log collection mechanisms, process them to apply common cleansing and standardizations, and make them available for consumption. Each server or application produces its own format for logs and your job is to bring them to a format that you can use and use technologies to handle the heavy volume of log streams.

Use case overview

Let's assume your organization is on AWS and you have multiple applications deployed on AWS EC2 instances. These applications are written in Java and a few other languages and hosted through Apache or NGINX servers. You have the following three log streams that are generating logs continuously, which you plan to collect and make available...

Summary

Over the course of this chapter, we have dived deep into a few common use cases where Amazon EMR can be integrated for big data processing. We discussed how you can integrate Amazon EMR as a persistent or transient cluster and how you can use it for batch ETL, real-time streaming, interactive analytics, and ML and log analytics use cases. Each use case explained a reference architecture and a few recommendations around its implementation.

That concludes this chapter! Hopefully, you have got a good overview of different architecture patterns around Amazon EMR and are ready to dive deep into different Hadoop interfaces and EMR Studio in the next chapter.

Test your knowledge

Before moving on to the next chapter, test your knowledge with the following questions:

  1. Assume you are receiving data from multiple data sources and after ETL transformation storing the historical data in a data lake built on top of Amazon S3 and storing aggregated data in the Redshift data warehouse. You have a requirement to provide unified query engine access, where your users can join both data lake and data warehouse data for analytics. How will you design the architecture and which query engine you will recommend to your analysts?
  2. Your organization has multiple teams and departments that have different big data and ML workloads. They plan to use a common EMR cluster that they can use for their analytics and ML model development. Your data scientists are new to Amazon EMR and would like to understand how they can take advantage of this EMR cluster to do ML model development. What will your guidance be?
  3. You have a customer use case where you...

Further reading

The following are a few resources you can refer to for further reading:

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Simplify Big Data Analytics with Amazon EMR
Published in: Mar 2022Publisher: PacktISBN-13: 9781801071079
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Sakti Mishra

Sakti Mishra is an engineer, architect, author, and technology leader with over 16 years of experience in the IT industry. He is currently working as a senior data lab architect at Amazon Web Services (AWS). He is passionate about technologies and has expertise in big data, analytics, machine learning, artificial intelligence, graph networks, web/mobile applications, and cloud technologies such as AWS and Google Cloud Platform. Sakti has a bachelor’s degree in engineering and a master’s degree in business administration. He holds several certifications in Hadoop, Spark, AWS, and Google Cloud. He is also an author of multiple technology blogs, workshops, white papers and is a public speaker who represents AWS in various domains and events.
Read more about Sakti Mishra