Packt+ | Advance your knowledge in tech

You're reading from Mastering Hadoop 3

Product type Book

Published in Feb 2019

Publisher Packt

ISBN-13 9781788620444

Pages 544 pages

Edition 1st Edition

Languages

Java

Concepts

Data Processing

Authors (2):

Chanchal Singh

Manish Kumar

View More author details

Table of Contents (23) Chapters

Title Page

Dedication

About Packt

Foreword

Contributors

Preface

Journey to Hadoop 3

Deep Dive into the Hadoop Distributed File System

YARN Resource Management in Hadoop

Internals of MapReduce

SQL on Hadoop

Real-Time Processing Engines

Widely Used Hadoop Ecosystem Components

Designing Applications in Hadoop

Real-Time Stream Processing in Hadoop

Machine Learning in Hadoop

Hadoop in the Cloud

Hadoop Cluster Profiling

Who Can Do What in Hadoop

Network and Data Security

Monitoring Hadoop

Other Books You May Enjoy

Leave a review - let other readers know what you think

Index

Chapter 3. YARN Resource Management in Hadoop

From the very beginning of Hadoop's existence, it has consisted of two major parts, the storage part, which is known as the Hadoop Distributed File System (HDFS), and the processing part, which is known as MapReduce. In the previous chapter, we discussed the Hadoop Distributed File System, its architecture, and its internals. In Hadoop version 1, the only job that can be submitted and executed to Hadoop is MapReduce. In the present era of data processing, real-time and near real-time processing are favored over batch processing. Thus, there is a need for a generic application executor and Resource Manager that can schedule and execute all types of applications, including MapReduce, in real time or near real time. In this chapter, we will learn about YARN and will cover the following topics:

YARN architecture
YARN job scheduling and different types of scheduling
Resource Manager high availability
What node labels are and their advantages
Improvements...

Architecture

YARN stands for Yet Another Resource Negotiator, and was introduced with Apache Hadoop 2.0 to address the scalability and manageability issues that existed with the previous versions. In Hadoop 1.0, we have two major components for job execution: JobTracker and task tracker. JobTracker is responsible for managing resources and scheduling jobs. It is also responsible for tracking the status of each job and restarting them if there is any failure. The task trackers are responsible for running tasks and sending progress report to JobTracker. The JobTracker also reschedules failed tasks on different task trackers. As JobTracker could be overloaded with multiple tasks, Hadoop 1.0 made several changes in its architecture to eliminate the following limitations:

Scalability: In Hadoop 1.0, the JobTracker is responsible for scheduling the jobs, monitoring each job, and restarting them on failure. It means JobTracker spends the majority of its time managing the application's life cycle...

Introduction to YARN job scheduling

In the previous sections, we talked about the YARN architecture and its components. The Resource Manager has two major components; namely, the application manager and the scheduler. The Resource Manager scheduler is responsible for allocating the required resources to an application based on schedule policies. Before YARN, Hadoop used to allocate slots for map and reduce tasks from available memory, which restricts reduce tasks to run on slots allocated for map tasks and the other way around. YARN does not define map and reduce slots initially. Based on a request, it launches containers for tasks. This means that if any free container is available, it will be used for map or reduce tasks. As previously discussed in this chapter, the scheduler will not perform monitoring or status tracking for the any application. The scheduler receives requests from per application application masters with the resources requirement detail and executes its scheduling function...

FIFO scheduler

The FIFO scheduler uses the simple strategy of first come first serve. Memory will be allocated to applications based on the sequence of request time, which means the first application in the queue will be allocated the required memory, then the second, and so on. In case memory is not available, applications have to wait for sufficient memory to be available for them to launch their jobs. When the FIFO scheduler is configured, YARN will make a queue of requests and add applications to the queue, then launch applications one by one.

Capacity scheduler

The capacity scheduler makes sure that users get the guaranteed minimum amount of configured resources in the YARN cluster. The use of the Hadoop cluster increases with the use cases in the organization and it is very unlikely that organization creates separate Hadoop clusters for each use case because this will increase maintenance. One use case may be that different users in the same organization want to have a certain amount of resources reserved when they want to execute their tasks. The capacity scheduler helps in sharing the cluster resources in a cost-effective manner across different users in the same organization to meet the SLA by ensuring no other user uses resources configured for some other user in the cluster. In short, the cluster resources are shared across multiple user groups. The capacity scheduler works on the concept of queues. A cluster is divided into partitions known as queues and each queue is assigned with a certain percentage of resources. The...

Fair scheduler

In fair scheduling, all applications get almost an equal amount of the available resources. In fair scheduler, when the first application is submitted to YARN, it will assign all the available resources to the application. Now in any scenario, if the new application is submitted to the scheduler, the scheduler will start allocating resources to the new application until both the applications have almost an equal amount of resources for their execution. Unlike the two schedulers discussed before, the fair scheduler prevents applications from resource starvation and assures that the applications in the queue get the required memory for execution. The distribution of the minimum and maximum share resources are calculated by the scheduling queue by using the configuration provided in the fair scheduler. The application will get the amount of resources configured for the queue where the application is submitted and if a new application is submitted to the same queue, the total...

Resource Manager high availability

Resource Manager (RM) is the single point of failure in a YARN cluster as every request from a client goes through it. The Resource Manager also acts as a central system to allocate resources for various tasks. The failure of the resource manager will lead to failure of YARN and thus a client cannot obtain any information about the YARN cluster or a client cannot submit any application for execution. Therefore, it is important to implement high availability of Resource Manager to prevent any cluster failure. The following are a few important considerations for high availability:

Resource Manager state: It is very important to persist a resource manager state, which if stored in memory may be lost upon resource manager failure. If the state of the Resource Manager is available even after failure, we can restart the Resource Manager from the last failure point based on the last state.
Running application state: The Resource Manager persistent state store allows...

Node labels

The use of Hadoop in the organization increases over time and they board more use cases to the Hadoop platform. The data pipeline in an organization consists of multiple jobs. A Spark job may need machines with more RAM and powerful processing capabilities but, on the other hand, MapReduce can run on less powerful machines. Therefore, it is obvious that a cluster may consist of different types of machines to save infrastructure costs. A Spark job may need machines with high processing capability. YARN label is nothing but a marker for each machine so that machines with the same label name can be used for specific jobs. The nodes with more powerful processing capabilities can be labelled with the same name and then jobs that require more powerful machines can use the same node label during submission. Each node can only have one label assigned to it, which means the cluster will have a disjointed set of nodes or we can say a cluster is partitioned based on node labels. The YARN...

YARN Timeline server in Hadoop 3.x

The job history server in MapReduce provides the information about all the current and historical MapReduce. jobs details. The job history server was only able to capture the information about MapReduce jobs and it was not able to capture YARN level events and metrics. As we know, YARN has a capability to run applications other than MapReduce and thus, there was a need to have a YARN-specific application that can capture information about all the applications. The YARN Timeline server is responsible for retrieving current as well as historic information about applications. The metrics and information collected through a YARN Timeline server are generic in nature and hence have a common structure that helps in debugging the logs and capturing other metrics for any specific use. The Timeline server captures two types of information, which are as follows:

Application information: The application is submitted to the queue by the user and each application can...

Opportunistic containers in Hadoop 3.x

Containers are allocated to nodes by the scheduler only when there is sufficient unallocated resources at a node. YARN guarantees that once the application master dispatches a container to a node, the execution will immediately start. The execution of a container will only be completed if there is no violation of fairness or capacity, which means until some other containers ask for preemption of resources from the node, the container is guaranteed to run to completion.

The current container execution design allows an efficient task execution but it has two primary limitations, which are as follows:

Heartbeat delay: The Node Manager at regular intervals sends heartbeats to its resource manager and the heartbeat request also contains the resource metrics of a Node Manager. If any container running on a Node Manager finishes its execution then the information is sent as part of request in the next heartbeat, which means that the Resource Manager knows that...

Docker containers in YARN

Docker has been widely used as a light weighted container for various applications. YARN is now widely used as a resource manager for diverse applications and it uses Linux to launch containers. YARN has added support for Docker containerization. The Docker image can be specified to run the YARN container and the Docker container has custom libraries to run the application.

The Docker environment is completely different from those of a Node Manager. The user does not need to worry about additional software or modules required to run the application and can focus on running and fine tuning the application. Different versions of the same application can be run in parallel and they will be completely isolated from one another.

The ContainerExecutor abstraction provides four implementations that are responsible for providing the resources required for running the application, setting up the environment, and managing the life cycle of containers, which are as follows...

YARN REST APIs

YARN also introduced REST API to access the information of the cluster, nodes of the cluster, applications, and so on. You can also build your own application to interact with YARN services by using these REST APIs. The important REST APIs provided by the YARN service are explained in the following sections.

Resource Manager API

The Resource Manager is the primary contact for any application and therefore it contains around 80% of the information that can be accessed via YARN's REST API. The YARN REST API has many retrieval applications, which will be explained as follows:

Retrieving cluster information: The basic API is used to access cluster information that contains information such as clusterID, when did the cluster start, what is the state of the cluster, versions of Hadoop, the Resource Manager, and so on. The CURL request on the REST API will look as follows:

curl -X GET http://localhost:8088/ws/v1/cluster/info

The default response would be in JSON, but you can also specify...

YARN command reference

Similar to HDFS, YARN also has its own commands to manage the overall YARN cluster. YARN provides two command-line interfaces, one is for users who want to run any service on a YARN cluster and the other is for administrators who will manage the overall YARN cluster.

User command

The user command in a Hadoop cluster is the one who submits applications to the Hadoop cluster. The application may fail or sometimes they do not perform well. In such scenarios, logs are the first step to debug your application and YARN stores logs for applications and containers that can be accessed via a command-line interface.

Application commands

The application command is used to perform operations with applications submitted to the YARN cluster. The operation can include listing all the applications with a specific state, killing the application, debugging application logs, and so on. Let's look into a few commands and how to use them:

-appStates: This command is used along with -list...

Summary

This chapter focused on YARN and its components. We covered various concepts, such as YARN architecture and resource manager HA in detail. We then went through YARN schedulers and how they work. We also covered some basic configurations for the scheduler and explained when to use what. The primary focus was to introduce the new features that have been added to Hadoop version 3. Thus, we covered the opportunist container and the Timeline server. In the last part, we went through the Docker container executor and a few commonly used Hadoop commands. In the next chapter, we will cover MapReduce processing in detail, we will deep dive into MapReduce flow, and will look into each step in detail. We will go through different examples and look into a few techniques to improve the performance of applications.