Home Data Simplify Big Data Analytics with Amazon EMR

Simplify Big Data Analytics with Amazon EMR

By Sakti Mishra
books-svg-icon Book
Subscription FREE
eBook + Subscription €11.99
eBook €28.99
Print + eBook €37.99
READ FOR FREE Free Trial for 7 days. €11.99 p/m after trial. Cancel Anytime! BUY NOW BUY NOW BUY NOW
What do you get with a Packt Subscription?
This book & 6500+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with a Packt Subscription?
This book & 6500+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with eBook + Subscription?
Download this book in EPUB and PDF formats
This book & 6500+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with a Packt Subscription?
This book & 6500+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with eBook?
Download this book in EPUB and PDF formats
Access this title in our online reader
DRM FREE - Read whenever, wherever and however you want
Online reader with customised display settings for better reading experience
What do you get with video?
Download this video in MP4 format
Access this title in our online reader
DRM FREE - Watch whenever, wherever and however you want
Online reader with customised display settings for better learning experience
What do you get with Audiobook?
Download a zip folder consisting of audio files (in MP3 Format) along with supplementary PDF
READ FOR FREE Free Trial for 7 days. €11.99 p/m after trial. Cancel Anytime! BUY NOW BUY NOW BUY NOW
Subscription FREE
eBook + Subscription €11.99
eBook €28.99
Print + eBook €37.99
What do you get with a Packt Subscription?
This book & 6500+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with a Packt Subscription?
This book & 6500+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with eBook + Subscription?
Download this book in EPUB and PDF formats
This book & 6500+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with a Packt Subscription?
This book & 6500+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with eBook?
Download this book in EPUB and PDF formats
Access this title in our online reader
DRM FREE - Read whenever, wherever and however you want
Online reader with customised display settings for better reading experience
What do you get with video?
Download this video in MP4 format
Access this title in our online reader
DRM FREE - Watch whenever, wherever and however you want
Online reader with customised display settings for better learning experience
What do you get with Audiobook?
Download a zip folder consisting of audio files (in MP3 Format) along with supplementary PDF
  1. Free Chapter
    Chapter 1: An Overview of Amazon EMR
About this book
Amazon EMR, formerly Amazon Elastic MapReduce, provides a managed Hadoop cluster in Amazon Web Services (AWS) that you can use to implement batch or streaming data pipelines. By gaining expertise in Amazon EMR, you can design and implement data analytics pipelines with persistent or transient EMR clusters in AWS. This book is a practical guide to Amazon EMR for building data pipelines. You'll start by understanding the Amazon EMR architecture, cluster nodes, features, and deployment options, along with their pricing. Next, the book covers the various big data applications that EMR supports. You'll then focus on the advanced configuration of EMR applications, hardware, networking, security, troubleshooting, logging, and the different SDKs and APIs it provides. Later chapters will show you how to implement common Amazon EMR use cases, including batch ETL with Spark, real-time streaming with Spark Streaming, and handling UPSERT in S3 Data Lake with Apache Hudi. Finally, you'll orchestrate your EMR jobs and strategize on-premises Hadoop cluster migration to EMR. In addition to this, you'll explore best practices and cost optimization techniques while implementing your data analytics pipeline in EMR. By the end of this book, you'll be able to build and deploy Hadoop- or Spark-based apps on Amazon EMR and also migrate your existing on-premises Hadoop workloads to AWS.
Publication date:
March 2022
Publisher
Packt
Pages
430
ISBN
9781801071079

 

Chapter 1: An Overview of Amazon EMR

This chapter will provide an overview of Amazon Elastic MapReduce (EMR), its benefits related to big data processing, and how its cluster is designed compared to on-premises Hadoop clusters. It will then explain how Amazon EMR integrates with other Amazon Web Services (AWS) services and how you can build a Lake House architecture in AWS.

You will then learn the difference between the Amazon EMR, AWS Glue, and AWS Glue DataBrew services. Understanding the difference will make you aware of the options available when deploying Hadoop or Spark workloads in AWS.

Before going into this chapter, it is assumed that you are familiar with Hadoop-based big data processing workloads, have had exposure to AWS basis concepts, and are looking to get an overview of the Amazon EMR service so that you can use it for your big data processing workloads.

The following topics will be covered in this chapter:

  • What is Amazon EMR?
  • Overview of Amazon EMR
  • Decoupling compute and storage
  • Integration with other AWS services
  • EMR release history
  • Comparing Amazon EMR with AWS Glue and AWS Glue DataBrew
 

What is Amazon EMR?

Amazon EMR is an AWS service that provides a distributed cluster for big data processing. Now, before diving deep into EMR, let's first understand what big data represents, for which EMR is a solution or tool.

What is big data?

The beginnings of enormous volumes of datasets date back to the 1970s, when the world of data was just getting started with data centers and the development of relational databases, despite the fact that the concept of big data was still relatively new. These technology revolutions led to personal desktop computers, followed by laptops, and then mobile computers over the next several decades. As people got access to devices, the data being generated started growing exponentially.

Around the year 2005, people started to realize that users generate huge amounts of data. Social platforms, such as Facebook, Twitter, and YouTube generate data faster than ever, as users get access to smart products or internet-related services.

Put simply, big data refers to large, complex datasets, particularly those derived from new data sources. These datasets are large enough that traditional data processing software can't handle its storage and processing efficiently. But these massive volumes of data are of great use when we need to derive insights by analyzing them and then address business problems with it, which we were not able to do before. For example, an organization can analyze their users' or customers' interactions with their social pages or website to identify their sentiment against their business and products.

Often, big data is described by the five Vs. It started with three Vs, which includes data volume, velocity, and variety, but as it evolved, the accuracy and value of data also became major aspects of big data, which is when veracity and value got added to represent it as five Vs. These five Vs are explained as follows:

  • Volume: This represents the amount of data you have for analysis and it really varies from organization to organization. It can range from terabytes to petabytes in scale.
  • Velocity: This represents the speed at which data is being collected or processed for analysis. This can be a daily data feed you receive from your vendor or a real-time streaming use case, where you receive data every second to every minute.
  • Variety: When we talk about variety, it means what the different forms or types of data you receive are for processing or analysis. In general, they are broadly categorized into the following three:
    • Structured: Organized data format with a fixed schema. It can be from relational databases or CSVs or delimited files.
    • Semi-structured: Partially organized data that does not have a fixed schema, for example, XML or JSON files.
    • Unstructured: These datasets are more represented through media files, where they don't have a schema to follow, for example, audio or video files.
  • Veracity: This represents how reliable or truthful your data is. When you plan to analyze big data and derive insights out of it, the accuracy or quality of the data matters.
  • Value: This is often referred to as the worth of the data you have collected as it is meant to give insights that can help the business drive growth.

With the evolution of big data, the primary challenge became how to process such huge volumes of data, because the typical single system processing frameworks were not enough to handle them. It needed a distributed processing computing framework that can do parallel processing.

After understanding what big data represents, let's look at how the Hadoop processing framework helped to solve this big data processing problem statement and why it became so popular.

Hadoop – processing framework to handle big data

Though there were different technologies or frameworks that came to handle big data, the framework that got the most traction is Hadoop, which is an open source framework designed specifically for storing and analyzing big datasets. It allows combining multiple computers to form a cluster that can do parallel distributed processing to handle gigabyte- to petabyte-scale data.

The following is a data flow model that explains how the input data is collected, stored into Hadoop Distributed File System (HDFS), then processed with Hive, Pig, or Spark big data processing frameworks and the transformed output becomes available for consumption or is transferred to downstream systems or external vendors. It represents a high-level data flow, where input data is collected and stored as raw data. It then gets processed as needed for analysis and then made available for consumption:

Figure 1.1 – Data flow in a Hadoop cluster

Figure 1.1 – Data flow in a Hadoop cluster

The following are the main basic components of Hadoop:

  • HDFS: A distributed filesystem that runs on commodity hardware and provides improved data throughput as compared to traditional filesystems and higher reliability with an in-built fault tolerance mechanism.
  • Yet Another Resource Negotiator (YARN): When multiple compute nodes are involved with parallel processing capability, YARN helps to manage and monitor compute CPU and memory resources and also helps in scheduling jobs and tasks.
  • MapReduce: This is a distributed framework that has two basic modules, that is, map and reduce. The map task reads the data from HDFS or a distributed storage layer and converts it into key-value pairs, which then becomes input to the reduce tasks, which ideally aggregates the map output to provide the result.
  • Hadoop Common: These include common Java libraries that can be used across all modules of the Hadoop framework.

In recent years, the Hadoop framework became popular because of its massively parallel processing (MPP) capability on top of commodity hardware and its fault-tolerant nature, which made it more reliable. It was extended with additional tools and applications to form an ecosystem that can help to collect, store, process, analyze, and manage big data. Some of the most popular applications are as follows:

  • Spark: An open source distributed processing system that uses in-memory caching and optimized execution for fast performance. Similar to MapReduce, it provides batch processing capability as well as real-time streaming, machine learning, and graph processing capabilities.
  • Hive: Allows users to use distributed processing engines such as MapReduce, Tez, or Spark to query data from the distributed filesystem through the SQL interface.
  • Presto: Similar to Hive, Presto is also an open source distributed SQL query engine that is optimized for low-latency data access from the distributed filesystem. It's used for complex queries, aggregations, joins, and window functions. The Presto engine is available as two separate components in EMR, that is, PrestoDB and PrestoSQL or Trino.
  • HBase: An open source non-relational or NoSQL database that runs on top of the distributed filesystem that provides fast lookup for tables with billions of rows and millions of columns grouped as column families.
  • Oozie: Enables workflow orchestration with Oozie scheduler and coordinator components.
  • ZooKeeper: Helps in managing and coordinating Hadoop component resources with inter-component-based communication, grouping, and maintenance.
  • Zeppelin: An interactive notebook that enables interactive data exploration using Python and PySpark kind of frameworks.

Hadoop provides a great solution to big data processing needs and it has become popular with data engineers, data analysts, and data scientists for different analytical workloads. With its growing usage, Hadoop clusters have brought in high maintenance overhead, which includes keeping the cluster up to date with the latest software releases and adding or removing nodes to meet the variable workload needs.

Now let's understand the challenges on-premises Hadoop clusters face and how Amazon EMR comes as a solution to them.

Challenges with on-premises Hadoop clusters

Before Amazon EMR, customers used to have on-premises Hadoop clusters and faced the following issues:

  • Tightly coupled compute and storage architecture: Clusters used to use HDFS as their storage layer, where the data node's disk storage contributes to HDFS. In the case of node failures or replacements, there used to be data movement to have another replica of data created.
  • Overutilized during peak hours and underutilized at other times: As the autoscaling capabilities were not there, customers used to do capacity planning beforehand and add nodes to the cluster before usage. This way, clusters used to have a constant number of nodes; during peak usage hours, cluster resources were overutilized and during off-hours, they were underutilized.
  • Centralized resource with the thrashing of resources: As resources get overutilized during peak hours, this leads to the thrashing of resources and affects the performance or collapse of hardware resources.
  • Difficulty in upgrading the entire stack: Setting up and configuring services was a tedious task as you needed to install specific versions of Hadoop applications and when you planned to upgrade, there were no options to roll back or downgrade.
  • Difficulty in managing many different deployments (dev/test): As the cluster setup and configuration was a tedious task, developers didn't have the option to quickly build applications in new versions to prove feasibility. Also, spinning up different development and test environments was a time-consuming process.

To overcome the preceding challenges, AWS came up with Amazon EMR, which is a managed Hadoop cluster that can scale up and down as workload resource needs change.

Overview of Amazon EMR – managed and scalable Hadoop cluster in AWS

To give an overview, Amazon EMR is an AWS tool for big data processing that provides a managed, scalable Hadoop cluster with multiple deployment options that includes EMR on Amazon Elastic Compute Cloud (EC2), EMR on Amazon Elastic Kubernetes Service (EKS), and EMR on AWS Outposts.

Amazon EMR makes it simple to set up, run, and scale your big data environments by automating time-consuming tasks such as provisioning instances, configuring them with Hadoop services, and tuning the cluster parameters for better performance.

Amazon EMR is used in a variety of applications, including Extract, Transform, and Load (ETL), clickstream analysis, real-time streaming, interactive analytics, machine learning, scientific simulation, and bioinformatics. You can run petabyte-scale analytics workloads on EMR for less than half the cost of traditional on-premises solutions and more than three times faster than open source Apache Spark. Every year, customers launch millions of EMR clusters for their batch or streaming use cases.

Before diving into the benefits of EMR compared to an on-premises Hadoop cluster, let's look at a brief history of Hadoop and EMR releases.

A brief history of the major big data releases

Before we go further, the following diagram shows the release period of some of the major databases:

Figure 1.2 – Diagram explaining the history of major big data releases

Figure 1.2 – Diagram explaining the history of major big data releases

As you can see in the preceding diagram, Hadoop was created in 2006 based on Google's MapReduce whitepaper and then AWS launched Amazon EMR in 2009. Since then, EMR has added a lot of features and its recent launch of Amazon EMR on EKS provides the great capability to run Spark workloads in Kubernetes clusters.

Now is a good time to understand the benefits of Amazon EMR and how its cluster is configured to decouple compute and storage.

 

Benefits of Amazon EMR

There are numerous advantages of using Amazon EMR, and this section provides an overview of these advantages. This will in turn help you when looking for solutions based on Hadoop or Spark workloads:

  • Easy to use: You can set up an Amazon EMR cluster in minutes without having to worry about provisioning cluster instances, setting up Hadoop configurations, or tuning the cluster.

You get the ability to create an EMR cluster through the AWS console's user interface (UI), where you have both quick and advanced options to specify your cluster configurations, or you can use AWS command-line interface (CLI) commands or AWS SDK APIs to automate the creation process.

  • Low cost: Amazon EMR pricing is based on the infrastructure on top of which it is deployed. You can choose from the different deployment options EMR provides, but the most popular usage pattern is with Amazon EC2 instances.

When we configure or deploy a cluster on top of Amazon EC2 instances, the pricing depends on the type of EC2 instance and the Region you have selected to launch your cluster. With EC2, you can choose on-demand instances or you can reduce the cost by purchasing reserved instances with a commitment of usage. You can lower the cost even further by using a combination of spot instances, specifically while scaling the cluster with task nodes.

  • Scalability: One of the biggest advantages of EMR compared to on-premises Hadoop clusters is its elastic nature, using which you can increase or decrease the number of instances of your cluster. You can create your cluster with a minimal number of instances and then can scale your cluster as the job demands. EMR provides two scalability options, autoscaling and managed scaling, which scales the cluster based on resource utilization.
  • Flexibility: Though EMR provides a quick cluster creation option, you have full control over your cluster and jobs, where you can make customizations in terms of setup or configurations. While launching the cluster, you can select the default Linux Amazon Machine Images (AMIs) for your instances or integrate custom AMIs and then install additional third-party libraries or configure startup scripts/jobs for the cluster.

You can also use EMR to reconfigure apps on clusters that are already running, without relaunching the clusters.

  • Reliability: Reliability is something that is built into EMR's core implementation. The health of cluster instances is constantly monitored by EMR and it automatically replaces failed or poorly performing instances. Then new tasks get instantiated in newly added instances.

EMR also provides multi-master configuration (up to three master nodes), which makes the master node fault-tolerant. EMR also keeps the service up to date by including stable releases of the open source Hadoop and related application software at regular intervals, which reduces the maintenance effort of the environment.

  • Security: EMR automatically configures a few default settings to make the environment secure, including launching the cluster in Amazon Virtual Private Cloud (VPC) with required network access controls and configuring security groups for EC2 instances.

It also provides additional security configurations that you can utilize to improve the security of the environment, which includes enabling encryption through AWS KMS keys or your own managed keys, configuring strong authentication with Kerberos, and securing the in-transit data through SSL.

You can also use AWS Lake Formation or Apache Ranger to configure fine-grained access control on the cluster databases, tables, or columns. We will dive deep into each of these concepts in later chapters of the book.

  • Ease of integration: When you build a data analytics pipeline, apart from EMR's big data processing capability, you might also need integration with other services to build the production-scale implementation.

EMR has native integration with a lot of additional services and some of the major ones include orchestrating the pipeline with AWS Step Functions or Amazon Managed Workflows for Apache Airflow (MWAA), close integration with AWS IAM to integrate tighter security control, fine-grained access control with AWS Lake Formation, or developing, visualizing, and debugging data engineering and data science applications built in R, Python, Scala, and PySpark using the EMR Studio integrated development environment (IDE).

  • Monitoring: EMR provides in-depth monitoring and audit capability on the cluster using AWS services such as CloudWatch and CloudTrail.

CloudWatch provides a centralized logging platform to track the performance of your jobs and cluster and define alarms based on specific thresholds of specific metrics. CloudTrail provides audit capability on cluster actions. Amazon EMR also has the ability to archive log files in Amazon Simple Storage Service (S3), so you can refer to them for debugging even after your cluster is terminated.

Apart from CloudWatch and CloudTrail, you can also use the Ganglia monitoring tool to monitor cluster instance health, which is available as an optional software configuration when you launch your cluster.

             
About the Author
  • Sakti Mishra

    Sakti Mishra is an engineer, architect, author, and technology leader with over 16 years of experience in the IT industry. He is currently working as a senior data lab architect at Amazon Web Services (AWS). He is passionate about technologies and has expertise in big data, analytics, machine learning, artificial intelligence, graph networks, web/mobile applications, and cloud technologies such as AWS and Google Cloud Platform. Sakti has a bachelor’s degree in engineering and a master’s degree in business administration. He holds several certifications in Hadoop, Spark, AWS, and Google Cloud. He is also an author of multiple technology blogs, workshops, white papers and is a public speaker who represents AWS in various domains and events.

    Browse publications by this author
Latest Reviews (1 reviews total)
Dobra książka acz trochę nudna, głównie z uwagi na układ treści. Przez pierwsze rozdziały jest głównie teoria. Prawdziwie ćwiczenia są na końcu. W ćwiczeniach brakuje pisania dokładnych kroków jak sprawić by ćwiczenie się prawidło. Autor zakłada już jakiś ogarnięcie i umiejętności w konfiguracji AWS, IAM, VPC itp. Same ćwiczenia z EMR są niestety też dość kosztowne, ale na to już raczej nic się nie da zrobić.
Simplify Big Data Analytics with Amazon EMR
Unlock this book and the full library FREE for 7 days
Start now