Packt+ | Advance your knowledge in tech

You're reading from Hadoop 2.x Administration Cookbook

Product typeBook

Published inMay 2017

PublisherPackt

ISBN-139781787126732

Edition1st Edition

Tools

Hadoop

Concepts

System Administration

Author (1)

Aman Singh

Chapter 3. Maintaining Hadoop Cluster – YARN and MapReduce

In this chapter, we will cover the following recipes:

Running a simple MapReduce program
Hadoop streaming
Configuring YARN history server
Job history web interface and metrics
Configuring ResourceManager components
YARN containers resource allocations
ResourceManager Web UI and JMX metrics
Preserving ResourceManager states

Introduction

In the previous chapters, we learned about the storage layer HDFS, how to configure it, and what are its different components. We mainly talked about Namenode, Datanode, and its concepts.

In this chapter, we will take a look at the processing layer which is MapReduce and the resource management framework YARN. Prior to Hadoop 2.x, MapReduce was the only processing layer for Hadoop, but the introduction of YARN as a framework, provided a pluggable processing layer, which could be MapReduce, Spark, and so on.

Note

While the recipes in this chapter will give you an overview of a typical configuration, we encourage you to adapt this proposal according to your needs. The deployment directory structure varies according to IT policies within an organization.

Running a simple MapReduce program

In this recipe, we will look at how to make sense of the data stored on HDFS and extract useful information out of the files like the number of occurrences of a string, a pattern, or estimations, and various benchmarks. For this purpose, we can use MapReduce, which is a computation framework that helps us answer many questions we might have about the data.

With Hadoop, we can process huge amount of data. However, to get an understanding of its working, we'll start with a simple program such as pi estimation or a word count example.

ResourceManager is the master for Yet another Resource Negotiator (YARN). The Namenode stores the file metadata and the actual blocks/data reside on the slave nodes called Datanodes. All the jobs are submitted to the ResourceManager and it then assigns tasks to its slaves, called NodeManagers.

When a job is submitted to ResourceManager (RM), it will check for the job queue it is submitted to and whether the user has permissions...

Hadoop streaming

In this recipe, we will look at how we can execute jobs on an Hadoop cluster using scripts written in Bash or Python. It is not mandatory to use only Java for programming MapReduce code; any language can be used by evoking the Hadoop streaming utility. Do not confuse this with real-time streaming, which is different from what we will be discussing here.

Getting ready

To step through the recipes in this chapter, make sure you have a running cluster with HDFS and YARN setup correctly as discussed in the previous chapters. This can be a single node cluster or a multinode cluster, as long as the cluster is configured correctly.

It is not necessary to know Java to run MapReduce programs on Hadoop. Users can carry forward their existing scripting knowledge and use Bash or Python to run the job on Hadoop.

How to do it...

Connect to an edge node in the cluster and switch to user hadoop.
The streaming JAR is also under the location as Hadoop /opt/cluster/hadoop/share/hadoop/tools/lib/hadoop...

Configuring YARN history server

Whenever a MapReduce job runs, it launches containers on multiple nodes and the logs for that container are only written on that particular node. If the user needs details of the job, he needs to go to all the nodes to fetch the logs, which could be very tedious in large clusters.

A better approach will be to aggregate the logs at a common location once the job finishes and then it can be accessed using a web server or other means. To address this, History Server was introduced in Hadoop, to aggregate logs and provide a Web UI, for users to see logs for all the containers of a job at one place.

Getting ready

You need to have a running cluster with YARN set up and should have completed the previous recipe to make sure the cluster is working fine in terms of HDFS and YARN.

The following steps will guide you through the process of setting up Job history server.

How to do it...

Connect to the ResourceManager node, which is the YARN master and switch to user hadoop.
Navigate...

Job history web interface and metrics

In the previous recipe, we enabled history server, and now we will use the Web UI to the explore YARN metrics and job history.

Getting ready

Make sure you have completed the previous recipe and have a History Server running as daemon, as shown here in the list of processes:

How to do it...

Using web browser, connect to the JobHistoryServer Web UI port, which in this case is port 19888 and host IP master1.cyrus.com.
Once connected to the Web UI, the user can see JobHistory and other details as shown here:
Under the Tools section on the left-most side, the user can see links to view YARN parameters currently in effect, using the link configuration as shown here:
Another section is metrics, which gives information about JvmMetrics, stats, and so on. The output format is JSON:
The preceding output can also be viewed from the command line, as shown in the following screenshot:

How it works...

In Hadoop, each daemon has a built-in web server, which is a jetty server...

Configuring ResourceManager components

In YARN, ResourceManager is modular in nature, primarily limited to scheduling and not bothered about application state management, which is delegated to Application Masters. Although there are many components of RM, the core ones are: ApplicationsManager (AsM), Application Master Launcher, scheduler, and ResourceManager. AsM keeps track of which AM got assigned for which job and requests the launch of AM using AM Launcher. These are all part of the ResourceManager and are depicted in the following diagram. These components can be segregated for better control and management of resources:

In this recipe, we will see how different components can be separated out, although not necessary, and controlled independently.

Getting ready

Before starting with this recipe, it is good to read about the components of ResourceManager and what each component does. There are lot of good resources available online for this. Also, make sure that there is a running cluster...

YARN containers and resource allocations

In YARN, there are many configuration parameters, which control the memory available to AM, containers, or the total memory that can be allocated for MapReduce or JVM heap size to be used and the number of CPU cores to be used for a job. This is covered in more detail in Chapter 8, Performance Tuning, but a rough idea is to have one core for each container and each Mapper container should have a memory of about 1 GB and the Reducer should have a memory twice the size of the Mapper. In addition to this, each node must have about 20% spare for the operating system and Hadoop daemons.

Getting ready

For this recipe, you will again need a running cluster and should have completed the previous recipes to make sure the cluster is working fine in terms of HDFS and YARN.

How to do it...

Connect to the master1.cyrus.com master node and switch to user hadoop.
Navigate to the directory /opt/cluster/hadoop/etc/hadoop.
Edit the configuration file yarn-site.xml, to make...

ResourceManager Web UI and JMX metrics

In the previous recipe, we presented how to configure parameters for YARN and MapReduce. As stated initially, each daemon runs a Jetty web server, which can be accessed using a web browser.

Users must take note of the fact that their RPC ports are different from HTTP ports and must not be confused with the options we used in the previous recipe. There are default web ports such as Namenode 50070, ResourceManager 8088, Datanode 50075. All these can be configured to custom ports, if needed.

Getting ready

Make sure that the user has a running cluster with YARN and HDFS configured. The user must be able to run MapReduce jobs on it.

How to do it...

Point your web browser to http://master1.cyrus.com/8088, to access the ResourceManager Web UI:
The Web UI gives information on running the application and the resources it uses, as shown in the following screenshot:
The web interface also shows the scheduler used, which is by default capacity scheduler:
The ResourceManager...

Preserving ResourceManager states

It is important to preserve the state of ResourceManager during the restart of RM, so as to keep the application running with minimal interruptions. The concept is that the RM preserves the application state in a store and reloads it on restart. ApplicationMasters (AM) and NodeManagers continuously poll RM for status and re-register with it when available, thus resuming the containers from saved state.

Getting ready

For this recipe, you will again need a running cluster and have completed the previous recipes to make sure the cluster is working fine in terms of HDFS and YARN.

How to do it...

Connect to the master1.cyrus.com master node and switch to user hadoop.
Navigate to the directory /opt/cluster/hadoop/etc/hadoop.
Edit the yarn-site.xml configuration file to make the necessary changes as shown in the following steps.
Enable RM recovery by making changes as shown in the following screenshot:
Specify the state-store to be used for this, as shown in the following...

The rest of the chapter is locked

You have been reading a chapter from

Hadoop 2.x Administration Cookbook

Published in: May 2017Publisher: PacktISBN-13: 9781787126732

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at $15.99/month. Cancel anytime

Author (1)

Aman Singh

Gurmukh Singh is a seasoned technology professional with 14+ years of industry experience in infrastructure design, distributed systems, performance optimization, and networks. He has worked in big data domain for the last 5 years and provides consultancy and training on various technologies. He has worked with companies such as HP, JP Morgan, and Yahoo. He has authored Monitoring Hadoop by Packt Publishing
Read more about Aman Singh

Other recommended products

Related to this chapter

Apache Hadoop 3 Quick Start Guide

Apache Hadoop is a widely used distributed data platform. It enables large datasets to be efficiently processed instead of using one large computer to store and process the data. This book will get you started with the Hadoop ecosystem, and introduce you to the main technical topics such as MapReduce, YARN and HDFS.

BookOct 2018220 pages

HBase High Performance Cookbook

BookJan 2017350 pages

Mastering Hadoop 3

This is a comprehensive guide to understand advanced concepts of Hadoop ecosystem. You will learn how Hadoop works internally, and build solutions to some of real world use cases. Finally, you will have a solid understanding of how components in the Hadoop ecosystem are effectively integrated to implement a fast and reliable Big Data pipeline

BookFeb 2019544 pages

Apache Hive Essentials

Apache Hive helps you deal with data summarization, queries, and analysis for huge amounts of data. This book will give you a background in big data, and familiarize you with your Hive working environment. Next you will cover advanced topics like performance and security in Hive and how to work efficiently to find solutions to big data problems.

BookJun 2018210 pages

Modern Big Data Processing with Hadoop

This book presents unique techniques to conquer different Big Data processing and analytics challenges using Hadoop. Practical examples are provided to boost your understanding of Big Data concepts and their implementation. By the end of the book, you will have all the knowledge and skills you need to become a true Big Data expert.

BookMar 2018394 pages

Mastering Apache Storm

With real-world examples and clear explanations, this book will ensure you will have a thorough mastery Apache Storm.You’ll get an understanding of deploying Storm on clusters. Introduce yourself to topics such as trident topology, monitoring, Storm Parallelism, scheduler and log processing. Learn how to integrate Storm with other well-known Big Data technologies such as HBase, Redis, Kafka, and Hadoop to realize the full potential of Storm.You will be able to use the knowledge to develop efficient, distributed real-time applications to cater to your business needs.

BookAug 2017284 pages

Data Lake for Enterprises

The term 'Data Lake' has recently emerged as a prominent term in the big data industry. Data scientists can make use of it in deriving meaningful insights which can be used by businesses to redefine or transform the way they operate. Lambda architecture is also emerging as one of the very eminent patterns in the big data landscape, as it helps to derive useful information from not only the historical data but also correlates real-time data to enable business for taking critical decisions. This book tries to bring these two important aspects into one, namely data lake and lambda architecture.

BookMay 2017596 pages

Personalised recommendations for you

Based on your interests and search pattern

Designing and Implementing Microsoft Azure Networking Solutions

Designing and Implementing Microsoft Azure Networking Solutions Exam Ref AZ-700 is an all-encompassing guide to the AZ-700 exam and contains all the information you need to succeed in the world of virtual networking with Azure. With this book, you will be fully prepared for the exam and the world of cloud networking.

BookAug 2023524 pages

Microsoft 365 Security, Compliance, and Identity Administration

The Microsoft 365 Security, Compliance, and Identity Administration is a comprehensive guide that helps you employ Microsoft 365's robust suite of features and empowers you to optimize your administrative tasks.

BookAug 2023630 pages

Zero Trust Overview and Playbook Introduction

Get started on Zero Trust with this step-by-step playbook and learn everything you need to know for a successful Zero Trust journey with tailored guidance for every role, covering strategy, operations, architecture, implementation, and measuring success. This book will become an indispensable reference for everyone in your organization.

BookOct 2023240 pages

The Self-Taught Cloud Computing Engineer

This self-study book helps you master multiple clouds, including AWS, Azure, and GCP, and serves as a roadmap to becoming a certified cloud computing expert. The book will guide you to develop a professional cloud career by helping you build a broad cloud knowledge base, developing hands-on cloud computing skills, and getting cloud certified.

BookSep 2023472 pages

Technology Operating Models for Cloud and Edge

This book will help you build and create ownership of a technology operating model, as well as connect your leadership with engineering and operations, keeping your internal and external customers in mind. It provides practical tips on why, where, and how to make the cloud and edge platform paradigm sing for you, your team, and your organization.

BookAug 2023228 pages

Azure Architecture Explained

Azure is the preferred platform to build mission-critical and secure apps. This book provides comprehensive coverage of essential Azure products, services, and solutions vital for every solution architect's success. Elevate your knowledge and master the critical components of Azure to excel in your role with Azure Architecture Explained.

BookSep 2023446 pages

Pentesting Active Directory and Windows-based Infrastructure

This practical guide helps you explore the pentesting of Microsoft infrastructure in detail, and enhances your offensive skillset by showing you the different ways to perform security assessment. This book will help blue teamers and IT engineers get up to speed with possible security issues they may encounter in their Windows environments.

BookNov 2023360 pages

Practical Ansible

In Practical Ansible, you'll work with the latest release of Ansible and learn to solve complex issues quickly with the help of task-oriented scenarios. You'll start by installing and configuring Ansible to automate monotonous and repetitive IT tasks and get to grips with concepts such as playbooks, inventories, plugins, collections, and network modules.

BookSep 2023420 pages

Windows 11 for Enterprise Administrators

Microsoft’s launch of Windows 11 is a step toward satisfying the enterprise administrator’s needs for better management and enhanced user experience customization. This book provides the enterprise administrator with the knowledge needed to fully utilize the advanced feature set of Windows 11 Enterprise.

BookOct 2023286 pages

The Linux DevOps Handbook

This book is for software and IT professionals seeking knowledge on Linux systems and DevOps practices. This book will provide you with guidance and tools to learn and gain proficiency in managing Linux-based infrastructures and knowledge of DevOps.

BookNov 2023428 pages2