Search icon
Arrow left icon
All Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletters
Free Learning
Arrow right icon
Monitoring Hadoop

You're reading from  Monitoring Hadoop

Product type Book
Published in Apr 2015
Publisher
ISBN-13 9781783281558
Pages 100 pages
Edition 1st Edition
Languages
Author (1):
Aman Singh Aman Singh
Profile icon Aman Singh

Chapter 2. Hadoop Daemons and Services

In this chapter, we'll look at Hadoop services and try to understand how and on what ports they communicate. The aim of this chapter is not to configure the Hadoop cluster, but to understand it from the perspective of monitoring. Hadoop is a distributed platform with various services running across the cluster. The coordination between services and the way they communicate plays a very important role in the working of the cluster. The communication can be done using TCP/IP or RPC over TCP, or it could be simply done using HTTP.

In this chapter, we will look at the communication between Hadoop components.

The following topics will be covered in this chapter:

  • Important services, ports used by Hadoop and how they communicate

  • Common issues faced by various daemons

  • Host level checks

Hadoop is highly configurable, and we can configure it to work optimally. Each of the Hadoop components has configuration files with which we can control service ports, data directories...

Hadoop daemons


Hadoop is a distributed framework with two important components: HDFS and MapReduce. Hadoop has two main versions: Hadoop 1.0 and Hadoop 2.0. The original Hadoop 1.0 has NameNode, DataNode, JobTracker, and TaskTracker. In Hadoop 2.0, a new YARN framework has come into picture, which replaces JobTracker and TaskTracker with ResourceManager and NodeManager respectively. HDFS is the File System or the storage layer, and MapReduce is the programming model.

Each layer has a master and a slave to handle the communication and coordination between them. In order to set up monitoring, it is important to take into account the services and ports used by each node.

NameNode

NameNode is the master node that takes care of the HDFS File System. There are many important things to take care in NameNode in terms of services and ports. The following table lists parameters which need to be monitored:

YARN framework


The YARN (Yet Another Resource Negotiator) is the new MapReduce framework. It is designed to scale for large clusters and performs much better as compared to the old framework. There are new sets of daemons in the new framework, and it is good to understand how they communicate with each other. The following diagram explains the daemons and ports on which they talk:

Common issues faced on Hadoop cluster

With a distributed framework of the scale of Hadoop, many things can go wrong. It is not possible to capture all the issues that could occur, but from a monitoring perspective, we can list the things that are common and can be monitored easily. The following table tries to capture the common issues faced in Hadoop:

Parameter

Description

dfs.name.dir

dfs.namenode.name.dir

This is the parameter in hdfs-site...

Summary


In this chapter, we discussed important Hadoop daemons and the ports on which they listen. Each daemon listens on a specific port and communicates with the respective daemons using a specific protocol. We looked at the ports for NameNode, DataNodes, and JobTracker and how they talk to each other.

Then we set up monitoring for each of the Hadoop nodes to enable host level checks such as disk quota, CPU usage, memory usage, and so on. In the upcoming chapters, we will talk about configuring checks for Hadoop services. In the next chapter, we will deal with Hadoop logging.

lock icon The rest of the chapter is locked
You have been reading a chapter from
Monitoring Hadoop
Published in: Apr 2015 Publisher: ISBN-13: 9781783281558
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime}

Issue

Description and steps that could help

High CPU utilization

This could be due to high query rate or faulty job. Use top command to find the offending processes. On NameNode, it could be due to a large number of handlers or DataNodes sending block reports at...