You're reading from Simplify Big Data Analytics with Amazon EMR

Product type Book

Published in Mar 2022

Publisher Packt

ISBN-13 9781801071079

Pages 430 pages

Edition 1st Edition

Languages

Concepts

Big Data

Author (1):

Sakti Mishra

Table of Contents (19) Chapters

Preface

Section 1: Overview, Architecture, Big Data Applications, and Common Use Cases of Amazon EMR

Chapter 1: An Overview of Amazon EMR

Chapter 2: Exploring the Architecture and Deployment Options

Chapter 3: Common Use Cases and Architecture Patterns

Chapter 4: Big Data Applications and Notebooks Available in Amazon EMR

Section 2: Configuration, Scaling, Data Security, and Governance

Chapter 5: Setting Up and Configuring EMR Clusters

Chapter 6: Monitoring, Scaling, and High Availability

Chapter 7: Understanding Security in Amazon EMR

Chapter 8: Understanding Data Governance in Amazon EMR

Section 3: Implementing Common Use Cases and Best Practices

Chapter 9: Implementing Batch ETL Pipeline with Amazon EMR and Apache Spark

Chapter 10: Implementing Real-Time Streaming with Amazon EMR and Spark Streaming

Chapter 11: Implementing UPSERT on S3 Data Lake with Apache Spark and Apache Hudi

Chapter 12: Orchestrating Amazon EMR Jobs with AWS Step Functions and Apache Airflow/MWAA

Chapter 13: Migrating On-Premises Hadoop Workloads to Amazon EMR

Chapter 14: Best Practices and Cost-Optimization Techniques

Other Books You May Enjoy

Chapter 6: Monitoring, Scaling, and High Availability

In the previous chapter, you learned how to set up your EMR cluster and configure it with advanced settings related to hardware, software, and security and how to troubleshoot failures or slow-running clusters. In this chapter, we will dive deeper into cluster monitoring, scaling, and high-availability features.

Scaling cluster resources is an important aspect as you don't need to manually resize the cluster and also size the cluster based on specific workloads. In this chapter, you will learn about the autoscaling and managed scaling capabilities of EMR and how Amazon CloudWatch monitoring plays a role in it.

The following are the high-level topics that we will cover in this chapter:

Monitoring your EMR cluster
Scaling cluster resources
Comparing managed scaling with autoscaling
Cluster cloning and high availability with multiple master nodes

Technical requirements

In this chapter, we will dive deep into EMR cluster monitoring, scaling, and high-availability aspects. To test out the features and configurations, you will need the following resources before you get started:

An AWS account
An IAM user that has permission to create an EMR cluster, EC2 instances, and dependent IAM roles and has access to CloudWatch, CloudTrail logs, and more

Now, let's dive deep into the EMR cluster's monitoring aspects, which includes web interfaces available for your cluster's big data applications and Amazon CloudWatch and CloudTrail logs.

Monitoring your EMR cluster

When you think about monitoring your Amazon EMR cluster, you can consider the following options:

Using the EMR console to get the overall cluster status, the health of nodes, and the high-level status of YARN or Hadoop Spark applications
Analyzing logs generated by EMR and your big data applications, which might be stored in the master node or core task nodes
Accessing web interfaces of different Hadoop applications to analyze the job status or task execution or Ganglia to monitor the overall performance of your cluster
Using Amazon CloudWatch for logging, monitoring, and integrating rule-based notifications
Using Amazon CloudTrail to audit the access logs for your EMR cluster APIs

We covered the first two options in the previous chapter, where we explained how you can use the EMR console to monitor cluster status and how you can access logs available in the master node with the log archive to Amazon S3.

Now, let's...

Scaling cluster resources

When you launch an Amazon EMR cluster for big data processing, most of the time, the computing capacity you need for your jobs is different. The number of resources you need for your cluster depends on the data volume of the file size, the kind of processing logic you have, and whether your cluster resources are being shared by any other jobs.

There are a few cases where you have defined a data volume and you are able to do capacity planning to launch a fixed node cluster that does not need any scaling capacity. But in most cases, you will have a variable workload or a shared cluster for multiple workloads that needs to react to on-demand capacity needs, where you will need to scale your cluster capacity dynamically.

Amazon EMR provides flexibility to configure the scaling of cluster resources as it provides two scaling features, that is, EMR-managed scaling and autoscaling with a custom scaling policy. When considering automatic scaling of your cluster...

Cluster cloning and high availability with multiple master nodes

You have learned about different cluster configurations, such as cluster scaling, debugging, and monitoring. Next, we will look at how to configure your EMR cluster to be highly available with multiple master nodes and how to clone an existing cluster that might be active or terminated.

High availability with multiple master nodes

Starting from EMR 5.23.0, you can launch an EMR cluster with multiple master nodes, which provides high availability for cluster applications such as YARN, HDFS NameNode, Spark, Hive, and Ganglia. You can use the EMR console or the AWS CLI to launch a cluster that has either one or three master nodes. If your cluster's primary master node fails or your NameNode or ResourceManager crashes, then EMR will automatically failover to stand by the master node, which makes the cluster fault-tolerant.

EMR automatically replaces the failed node with a new master node that has the same...

Summary

Over the course of this chapter, we got an overview of how to monitor cluster and job activities using a cluster's application interfaces, cluster metrics, and the CloudWatch console. We also saw how to enable auditing on cluster API activities using AWS CloudTrail.

Then, we dived deep into EMR cluster scaling capabilities, which includes EMR-managed scaling and autoscaling with custom policies. We also learned how they compare to each other.

Finally, we covered how to make our cluster highly scalable with multiple master nodes and what the supported applications are. We also learned how we can clone an existing cluster to replicate its configurations and steps.

That concludes this chapter! Hopefully, you got a good overview of monitoring, scaling, and high-availability aspects of the cluster, and in the next chapter, we can dive deep into security aspects of EMR.

Test your knowledge

Before moving on to the next chapter, test your knowledge with the following questions:

Assume you have a long-running EMR cluster that is being used by multiple teams for ETL jobs and data analysis. Because of its multi-tenant nature, your organization asks that you provide a report of who is accessing the cluster and for which activities. How would you prepare such a report and from where will you collect this information?
Assume you have a long-running EMR cluster that integrates instance fleets into its configurations. Your cluster has one master and three core nodes to start with and you are planning to benefit from EMR scaling capabilities so that when you have more workload, your cluster will scale up, and when the jobs are finished, it will scale down. Out of EMR-managed scaling and autoscaling with custom policies, which one will you choose?
You have a long-running EMR cluster that is being used by multiple teams of your organization. You...