Search icon
Arrow left icon
All Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletters
Free Learning
Arrow right icon
Simplify Big Data Analytics with Amazon EMR

You're reading from  Simplify Big Data Analytics with Amazon EMR

Product type Book
Published in Mar 2022
Publisher Packt
ISBN-13 9781801071079
Pages 430 pages
Edition 1st Edition
Languages
Concepts
Author (1):
Sakti Mishra Sakti Mishra
Profile icon Sakti Mishra

Table of Contents (19) Chapters

Preface Section 1: Overview, Architecture, Big Data Applications, and Common Use Cases of Amazon EMR
Chapter 1: An Overview of Amazon EMR Chapter 2: Exploring the Architecture and Deployment Options Chapter 3: Common Use Cases and Architecture Patterns Chapter 4: Big Data Applications and Notebooks Available in Amazon EMR Section 2: Configuration, Scaling, Data Security, and Governance
Chapter 5: Setting Up and Configuring EMR Clusters Chapter 6: Monitoring, Scaling, and High Availability Chapter 7: Understanding Security in Amazon EMR Chapter 8: Understanding Data Governance in Amazon EMR Section 3: Implementing Common Use Cases and Best Practices
Chapter 9: Implementing Batch ETL Pipeline with Amazon EMR and Apache Spark Chapter 10: Implementing Real-Time Streaming with Amazon EMR and Spark Streaming Chapter 11: Implementing UPSERT on S3 Data Lake with Apache Spark and Apache Hudi Chapter 12: Orchestrating Amazon EMR Jobs with AWS Step Functions and Apache Airflow/MWAA Chapter 13: Migrating On-Premises Hadoop Workloads to Amazon EMR Chapter 14: Best Practices and Cost-Optimization Techniques Other Books You May Enjoy

Chapter 6: Monitoring, Scaling, and High Availability

In the previous chapter, you learned how to set up your EMR cluster and configure it with advanced settings related to hardware, software, and security and how to troubleshoot failures or slow-running clusters. In this chapter, we will dive deeper into cluster monitoring, scaling, and high-availability features.

Scaling cluster resources is an important aspect as you don't need to manually resize the cluster and also size the cluster based on specific workloads. In this chapter, you will learn about the autoscaling and managed scaling capabilities of EMR and how Amazon CloudWatch monitoring plays a role in it.

The following are the high-level topics that we will cover in this chapter:

  • Monitoring your EMR cluster
  • Scaling cluster resources
  • Comparing managed scaling with autoscaling
  • Cluster cloning and high availability with multiple master nodes

Technical requirements

In this chapter, we will dive deep into EMR cluster monitoring, scaling, and high-availability aspects. To test out the features and configurations, you will need the following resources before you get started:

  • An AWS account
  • An IAM user that has permission to create an EMR cluster, EC2 instances, and dependent IAM roles and has access to CloudWatch, CloudTrail logs, and more

Now, let's dive deep into the EMR cluster's monitoring aspects, which includes web interfaces available for your cluster's big data applications and Amazon CloudWatch and CloudTrail logs.

Monitoring your EMR cluster

When you think about monitoring your Amazon EMR cluster, you can consider the following options:

  • Using the EMR console to get the overall cluster status, the health of nodes, and the high-level status of YARN or Hadoop Spark applications
  • Analyzing logs generated by EMR and your big data applications, which might be stored in the master node or core task nodes
  • Accessing web interfaces of different Hadoop applications to analyze the job status or task execution or Ganglia to monitor the overall performance of your cluster
  • Using Amazon CloudWatch for logging, monitoring, and integrating rule-based notifications
  • Using Amazon CloudTrail to audit the access logs for your EMR cluster APIs

We covered the first two options in the previous chapter, where we explained how you can use the EMR console to monitor cluster status and how you can access logs available in the master node with the log archive to Amazon S3.

Now, let's...

Scaling cluster resources

When you launch an Amazon EMR cluster for big data processing, most of the time, the computing capacity you need for your jobs is different. The number of resources you need for your cluster depends on the data volume of the file size, the kind of processing logic you have, and whether your cluster resources are being shared by any other jobs.

There are a few cases where you have defined a data volume and you are able to do capacity planning to launch a fixed node cluster that does not need any scaling capacity. But in most cases, you will have a variable workload or a shared cluster for multiple workloads that needs to react to on-demand capacity needs, where you will need to scale your cluster capacity dynamically.

Amazon EMR provides flexibility to configure the scaling of cluster resources as it provides two scaling features, that is, EMR-managed scaling and autoscaling with a custom scaling policy. When considering automatic scaling of your cluster...

Cluster cloning and high availability with multiple master nodes

You have learned about different cluster configurations, such as cluster scaling, debugging, and monitoring. Next, we will look at how to configure your EMR cluster to be highly available with multiple master nodes and how to clone an existing cluster that might be active or terminated.

High availability with multiple master nodes

Starting from EMR 5.23.0, you can launch an EMR cluster with multiple master nodes, which provides high availability for cluster applications such as YARN, HDFS NameNode, Spark, Hive, and Ganglia. You can use the EMR console or the AWS CLI to launch a cluster that has either one or three master nodes. If your cluster's primary master node fails or your NameNode or ResourceManager crashes, then EMR will automatically failover to stand by the master node, which makes the cluster fault-tolerant.

EMR automatically replaces the failed node with a new master node that has the same...

Summary

Over the course of this chapter, we got an overview of how to monitor cluster and job activities using a cluster's application interfaces, cluster metrics, and the CloudWatch console. We also saw how to enable auditing on cluster API activities using AWS CloudTrail.

Then, we dived deep into EMR cluster scaling capabilities, which includes EMR-managed scaling and autoscaling with custom policies. We also learned how they compare to each other.

Finally, we covered how to make our cluster highly scalable with multiple master nodes and what the supported applications are. We also learned how we can clone an existing cluster to replicate its configurations and steps.

That concludes this chapter! Hopefully, you got a good overview of monitoring, scaling, and high-availability aspects of the cluster, and in the next chapter, we can dive deep into security aspects of EMR.

Test your knowledge

Before moving on to the next chapter, test your knowledge with the following questions:

  1. Assume you have a long-running EMR cluster that is being used by multiple teams for ETL jobs and data analysis. Because of its multi-tenant nature, your organization asks that you provide a report of who is accessing the cluster and for which activities. How would you prepare such a report and from where will you collect this information?
  2. Assume you have a long-running EMR cluster that integrates instance fleets into its configurations. Your cluster has one master and three core nodes to start with and you are planning to benefit from EMR scaling capabilities so that when you have more workload, your cluster will scale up, and when the jobs are finished, it will scale down. Out of EMR-managed scaling and autoscaling with custom policies, which one will you choose?
  3. You have a long-running EMR cluster that is being used by multiple teams of your organization. You...

Further reading

The following are a few resources you can refer to for further reading:

lock icon The rest of the chapter is locked
You have been reading a chapter from
Simplify Big Data Analytics with Amazon EMR
Published in: Mar 2022 Publisher: Packt ISBN-13: 9781801071079
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime}