You're reading from Simplify Big Data Analytics with Amazon EMR

Product typeBook

Published inMar 2022

PublisherPackt

ISBN-139781801071079

Edition1st Edition

Tools

AWS

Concepts

Big Data

Author (1)

Sakti Mishra

Chapter 3: Common Use Cases and Architecture Patterns

This chapter provides an overview of common use cases and architecture patterns you will see with Amazon Elastic MapReduce (EMR) and how EMR integrates with different AWS services to solve specific use cases. The use cases include batch Extract, Transform, and Load (ETL), real-time streaming, clickstream analytics, interactive analytics with machine learning (ML), genomics data analysis, and log analytics.

This should give you a starting point to understand what problem statements you can solve using Amazon EMR and use it to solve your real-world big data use cases.

We will dive deep into the following topics in this chapter:

Reference architecture for batch ETL workloads
Reference architecture for clickstream analytics
Reference architecture for interactive analytics and ML
Reference architecture for real-time streaming analytics
Reference architecture for genomics data analytics
Reference architecture...

Reference architecture for batch ETL workloads

When data analysts receive data from different data sources, the first thing they do is transform it into a format that can be used for analysis or reporting. This data transformation process might involve several steps to bring it to the desired state and, after it is ready, you need to load it into a data warehousing system or data lake, which can be consumed by data analysts or data scientists.

To make the data available for consumption, you need to extract it from the source, transform it with different steps, and then load it into the target storage layer – hence the term ETL. For a few other use cases, when the raw data is in a structured format, you can then load it into a relational database or data warehouse and then transform it with SQL, where it becomes Extract, Load, and Transform (ELT).

What we understand from all this is that transformation is the primary piece that makes raw data ready for consumption. What...

Reference architecture for clickstream analytics

In consumer-facing applications, such as web applications or mobile applications, business owners are more interested in identifying metrics from a user's access patterns to derive insights into which products, services, or features users like more. This enables business leaders to make more accurate decisions. Often, it becomes a necessity to capture user actions or clicks in real time to have a real-time dashboard suggesting how successful your campaign is or how users are responding to your new product launch.

To make business decisions in real time, you need to have the data flow in near real time too. This means as soon as the user clicks anywhere within the application, you need to capture an event immediately and push it through your backend system for processing. As multiple users access your application through different channels, it generates a stream of events and you need a scalable architecture that can support...

Reference architecture for interactive analytics and ML

In the previous sections of this chapter, you might have seen the usage of Amazon EMR as a transient cluster that gets created through file arrival or a scheduled event, processes the file with Hive or Spark steps, and then gets terminated. Transient clusters are great to decouple storage and compute and also to save costs by reducing cluster idle time.

But there are few use cases where you might need a persistent EMR cluster that might be active 24x7 with minimal cluster node capacity and goes through the EMR autoscaling feature to scale up and down as needed. These persistent clusters generally serve multiple workloads, including ETL transformations with Hive/Spark, analyzing data through SQL-based query engines such as Hive and Presto, or interactive ML model development through notebooks. In a few cases, you can implement a multi-tenant EMR cluster that serves multiple teams with an access policy and data isolation.

...

Reference architecture for real-time streaming analytics

At the beginning of the chapter, you learned about clickstream analytics that integrated Amazon KDS and EMR with Spark Streaming to stream clickstream events in real time. The use case covered in this section is another use case that explains how you can stream Internet of Things (IoT) device events in real time to your data lake and data warehouse for real-time dashboards.

To give an overview of IoT, it's a network of physical objects called things that uses sensors and related software technologies to connect and exchange information with other devices or systems over the internet. These devices can be any household or industrial equipment that has a sensor and required software embedded into it to communicate with other devices or send messages to a central unit that monitors requests or signals.

The adoption of IoT around the world is increasing as analytics on device data can provide a lot of insights to optimize...

Reference architecture for genomics data analytics

Before going into the technical implementation details of genomics data analytics, let's understand what genomics means. It is a field of study of biology that focuses on the evolution, mapping, structure, and functions of genomes. A genome is a complete set of DNA of a living being, which includes all of its genes.

In recent times, there have been significant investments in genomics and clinical data to explore more about living beings' genes and their characteristics, which can help diagnose any disease beforehand or predict new features. Technology continues to play a vital role in genomics studies: as the data volume grows, you can use big data technologies for distributed processing.

Genomics datasets are available in complex data formats, such as VCF and gVCF, and to parse them, there are several popular frameworks available, such as Glow and Hail.

Use case overview

Let's assume your organization...

Reference architecture for log analytics

Log analytics is a common requirement in most enterprises. As you grow with multiple applications, jobs, or servers that produce enormous logs every day, it becomes essential to aggregate them for analysis.

There are several challenges in log analytics as you need to define log collection mechanisms, process them to apply common cleansing and standardizations, and make them available for consumption. Each server or application produces its own format for logs and your job is to bring them to a format that you can use and use technologies to handle the heavy volume of log streams.

Use case overview

Let's assume your organization is on AWS and you have multiple applications deployed on AWS EC2 instances. These applications are written in Java and a few other languages and hosted through Apache or NGINX servers. You have the following three log streams that are generating logs continuously, which you plan to collect and make available...

Summary

Over the course of this chapter, we have dived deep into a few common use cases where Amazon EMR can be integrated for big data processing. We discussed how you can integrate Amazon EMR as a persistent or transient cluster and how you can use it for batch ETL, real-time streaming, interactive analytics, and ML and log analytics use cases. Each use case explained a reference architecture and a few recommendations around its implementation.

That concludes this chapter! Hopefully, you have got a good overview of different architecture patterns around Amazon EMR and are ready to dive deep into different Hadoop interfaces and EMR Studio in the next chapter.

Test your knowledge

Before moving on to the next chapter, test your knowledge with the following questions:

Assume you are receiving data from multiple data sources and after ETL transformation storing the historical data in a data lake built on top of Amazon S3 and storing aggregated data in the Redshift data warehouse. You have a requirement to provide unified query engine access, where your users can join both data lake and data warehouse data for analytics. How will you design the architecture and which query engine you will recommend to your analysts?
Your organization has multiple teams and departments that have different big data and ML workloads. They plan to use a common EMR cluster that they can use for their analytics and ML model development. Your data scientists are new to Amazon EMR and would like to understand how they can take advantage of this EMR cluster to do ML model development. What will your guidance be?
You have a customer use case where you...

Different EMR case studies (search for EMR): https://aws.amazon.com/solutions/case-studies/
Redshift distribution style: https://docs.aws.amazon.com/redshift/latest/dg/c_best-practices-best-dist-key.html
Read more about AWS DMS: https://aws.amazon.com/dms/
Read more about AWS IoT: https://aws.amazon.com/iot/

The rest of the chapter is locked

You have been reading a chapter from

Simplify Big Data Analytics with Amazon EMR

Published in: Mar 2022Publisher: PacktISBN-13: 9781801071079

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at $15.99/month. Cancel anytime

Author (1)

Sakti Mishra

Sakti Mishra is an engineer, architect, author, and technology leader with over 16 years of experience in the IT industry. He is currently working as a senior data lab architect at Amazon Web Services (AWS). He is passionate about technologies and has expertise in big data, analytics, machine learning, artificial intelligence, graph networks, web/mobile applications, and cloud technologies such as AWS and Google Cloud Platform. Sakti has a bachelor’s degree in engineering and a master’s degree in business administration. He holds several certifications in Hadoop, Spark, AWS, and Google Cloud. He is also an author of multiple technology blogs, workshops, white papers and is a public speaker who represents AWS in various domains and events.
Read more about Sakti Mishra

Personalised recommendations for you

Based on your interests and search pattern

Et al.

Ever wonder why speech recognition systems don't understand the Scottish accent, or what would happen if an astronaut only ate mac 'n' cheese, or other spurious reflections you'd have at a bar? We did, then collated those deliberations into absurd research articles with fake figures and methodologies inspired by even more fictionally absurd studies.

BookAug 2023230 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages4

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages1

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Mastering Tableau 2023

This book is a comprehensive resource to mastering your Tableau skills and becoming a BI expert. As you progress, you will learn how to build advanced dashboards and improve your storytelling to derive key business insight, as well as make you well-versed with advanced functionalities of Tableau in the business intelligence domain.

BookAug 2023684 pages

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages5

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages2

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

BookOct 2023636 pages5

Modern Data Architecture on AWS

Every organization wants an agile, performant, and cost-effective data platform that meets all their current and future business needs. Purpose-built AWS analytics services and their features play a big part in building such a modern data platform. This book brings to you all the design and architectural patterns that’ll help you achieve this goal.

BookAug 2023420 pages5

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

BookDec 2023240 pages

TinyML Cookbook

With over 70 project-based recipes, the TinyML Cookbook is a practical guide that will help you to get the most out of your microcontrollers. It provides a comprehensive understanding of the theoretical foundations while giving you hands-on experience training ML models for deployment on Arduino Nano 33 BLE Sense, Raspberry Pi Pico, and SparkFun RedBoard Artemis Nano microcontrollers.

BookNov 2023664 pages

You're reading from Simplify Big Data Analytics with Amazon EMR

Chapter 3: Common Use Cases and Architecture Patterns

Reference architecture for batch ETL workloads

Reference architecture for clickstream analytics

Reference architecture for interactive analytics and ML

Reference architecture for real-time streaming analytics

Reference architecture for genomics data analytics

Use case overview

Reference architecture for log analytics

Use case overview

Summary

Test your knowledge

Further reading

Unlock this book and the full library FREE for 7 days

Author (1)

Et al.

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

Mastering Tableau 2023

Building AI Applications with ChatGPT APIs

Building AI Applications with ChatGPT APIs

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

Modern Data Architecture on AWS

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

TinyML Cookbook