Reader small image

You're reading from  Hadoop 2.x Administration Cookbook

Product typeBook
Published inMay 2017
PublisherPackt
ISBN-139781787126732
Edition1st Edition
Tools
Right arrow
Author (1)
Aman Singh
Aman Singh
author image
Aman Singh

Gurmukh Singh is a seasoned technology professional with 14+ years of industry experience in infrastructure design, distributed systems, performance optimization, and networks. He has worked in big data domain for the last 5 years and provides consultancy and training on various technologies. He has worked with companies such as HP, JP Morgan, and Yahoo. He has authored Monitoring Hadoop by Packt Publishing
Read more about Aman Singh

Right arrow

Chapter 11. Troubleshooting, Diagnostics, and Best Practices

In this chapter, we will cover the following recipes:

  • Namenode troubleshooting

  • Datanode troubleshooting

  • Resourcemanager troubleshooting

  • Diagnose communication issues

  • Parse logs for errors

  • Hive troubleshooting

  • HBase troubleshooting

  • Hadoop best practices

Introduction


In this chapter, we will look at best practices and troubleshooting techniques for various components of Hadoop. The same can be used to troubleshoot any other service or application.

With distributed systems and the scale at which Hadoop operates, it can become cumbersome to troubleshoot it. In production, most will use log management and parsing tools such as Splunk and a combination of Ganglia, Nagios, or other tools for monitoring and alerting.

In this chapter, we will build the basics of troubleshooting skills and how we can quickly look for keywords, which will point the users to common errors in the Hadoop cluster. Users are encouraged to read this chapter after reading Chapter 8, Performance Tuning, to better relate and understand the recipes in this chapter.

Namenode troubleshooting


In this recipe, we will see how to find issues with Namenode and resolve them. As this is a recipe book, we will keep the theory to a minimum, but users must understand the moto behind the commands and how the mentioned tools work.

Getting ready

To step through the recipes in this chapter, make sure you have gone through the steps to install Hadoop cluster with HDFS and YARN enabled. Make sure to use Multi-node Hadoop cluster for better understanding and troubleshooting practice.

It is assumed that the user has basic knowledge about networking fundamentals, Linux commands, and filesystem.

How to do it...

Scenario 1: Namenode not starting due to permission issues on the Namenode directory.

  1. Connect to the master1.cyrus.com master node in the cluster and change to user hadoop.

  2. Try to write a test file to the location using the following command. If it succeeds, then the permissions are fine:

    $ touch /data/namenode1/test
    
  3. Otherwise, make sure the permission of the directory...

Datanode troubleshooting


In this recipe, we will look at some of the common issues with Datanode and how to resolve them.

Getting ready

The user is expected to complete the previous recipe and must have completed the Setting up multi-node HBase cluster recipe in Chapter 9, HBase Administration. In this recipe, we will be using the already configured Hadoop cluster.

How to do it...

Scenario 1: Datanode not starting due to permission issues on the Datanode directory specified by dfs.datanode.data.dir:

  1. Connect to the dn1.cyrus.com master node in the cluster and change to user hadoop.

  2. Try to write a test file to the location using the following command:

    $ touch /space/dn1/test
    

    If it succeeds, then the permissions are fine.

  3. Otherwise, make sure the permissions of the directories pointed by dfs.datanode.data.dir are owned by the correct user. This is shown in the following screenshot:

  4. The user could be hadoop or hdfs. Also, the directory permission is 755 for the top directory, as shown in the following...

Resourcemanager troubleshooting


In this recipe, we will look at common Resourcemanager issues and how these can be addressed.

Getting ready

To step through the recipe in this section, make sure the users have completed the Setting up multi-node HBase cluster recipe in Chapter 9, HBase Administration.

How to do it…

Scenario 1: Resourcemanager daemon not starting.

  1. The Resourcemanager, by default, will bind to port 80030 to 80033 and 8088. These ports can be configured in the yarn-site.xml file and you should make sure these are unique and not used by any other service. In our labs, we used the ports as shown in the following screenshot:

  2. The listening ports can be seen by using the following command:

    $ netsta -tlpn
    
  3. Look into the logs for any Bind Errors and make sure the hostname is resolvable. Check for both forward and reverse lookup:

    $ nslookup <resource_manager_host>
    
  4. On Node Manager, the import ports are 8040, 8041, and 8042. These are used for scheduling, localization, and so on. So,...

Diagnose communication issues


In this recipe, we will look at how to troubleshoot communication issues between nodes and how we can quickly find common errors.

Getting ready

To step through the recipe, the user must have completed the Setting up multi-node HBase cluster recipe in Chapter 9, HBase Administration and have gone through the previous recipes in this chapter. It is good to have a basic knowledge of the DNS and TCP communication.

How to do it...

  1. Connect to the master1.cyrus.com master node in the cluster and switch to user hadoop.

  2. The first thing is to check which connections are already established to the nodes. This can be seen with the following command, as shown here:

  3. Check the reachability of nodes in the cluster using the following commands and also ensure reverse lookup for each host in the cluster:

    $ ping master1.cyrus.com
    $ ping dn1.cyrus.com
    $ nslookup "IP of Namenode, RM and Datanodes"
    
  4. If there is a reachability issue, check for firewall rules on any intermediate network devices...

Parse logs for errors


In this recipe, we will look at how to parse logs and quickly find errors. There are job logs, which are aggregated on HDFS, logs which include daemon logs, system logs, and so on.

We will look at some keywords and commands to find the errors in logs.

Getting ready

To complete the recipe, the user must have a running Hadoop cluster, must have completed the Setting up multi-node HBase cluster recipe in Chapter 9, HBase Administration, and know Bash or Perl/Python scripting basics.

How to do it...

  1. Connect to the edge1.cyrus.com node in the cluster and switch to user hadoop. However, we can connect to any node in the cluster from which we can access the logs.

  2. The location of the YARN logs on the cluster is exported as NFS export and mounted at location /logs/hadoop on the Edge node. Refer to the HDFS as NFS export recipe.

  3. All the other logs, such as system and daemon logs, from the cluster are exported to the location /logs/system.

  4. If the user is not from a Linux system background...

Hive troubleshooting


In this recipe, we will look at Hive troubleshooting steps and important keywords in the logs, which can help us to identify issues.

Getting ready

For this recipe, the user must have completed the Operating Hive with ZooKeeper recipe in Chapter 7, Data Ingestion and Workflow and have a basic understanding of database connectivity.

How to do it...

  1. Connect to the edge1.cyrus.com Edge node and switch to user hadoop.

  2. The Hive query logs location is defined by hive.querylog.location and the Hive server2 logs is defined by hive.server2.logging.operation.log.location.

  3. As an example, if I try to query a table that does not exist, we can see the errors in the Hive log, as shown in the following screenshot:

  4. Make it a good habit to read logs to troubleshoot, as logs will give hints about errors.

  5. Make sure Hive is able to connect to the Hive metastore. To verify this, first connect manually, as shown here:

    $ mysql –u Hadoop –h master1.cyrus.com -p
    
  6. Make sure the user used in Hive Hadoop...

HBase troubleshooting


In this recipe, we will look at HBase troubleshooting and how to identify some of the common issues in the HBase cluster.

Getting ready

Make sure that the user has completed the Setting up multi-node HBase cluster recipe in Chapter 9, HBase Administration for this section, and the assumption is that HDFS and YARN are working fine. Refer to previous recipes to troubleshoot any issues with the Hadoop cluster, before starting troubleshooting of HBase.

How to do it...

  1. Connect to the master1.cyrus.com master node and switch to user hadoop.

  2. Firstly, make sure ZooKeeper is up and the ensemble is healthy, as shown in the following screenshot — this is only if an external ZooKeeper is used:

  3. Rather than starting the entire cluster in one go, start each component one-by-one. Start hbase master using the following command:

    $ hbase-daemon.sh start master
    
  4. Quickly check which nodes and services the HBase master is talking to. In the following screenshot, we can see connections to ZooKeeper...

Hadoop best practices


In this section, we will cover some of the common best practices for the Hadoop cluster in terms of log management and troubleshooting tools.

These are not from a tuning perspective, but to make things easier to troubleshoot and diagnose.

Things to keep in mind:

  1. Always enable logs for each daemon that runs in the Hadoop cluster. Keep the logging level to INFO and, when needed, change it to DEBUG. Once the troubleshooting is done, revert to level INFO.

  2. Implement log rotation and retention polices to manage the logs.

  3. Use tools such as Nagios to alert for any errors in the cluster before it becomes an issue.

  4. Use log aggregation and analysis tools such as Splunk to parse logs.

  5. Never co-locate the logs disk with other data disks in the cluster.

  6. Use central configuration management systems such as Puppet or Chef to maintain consistent configuration across the cluster.

  7. Schedule a benchmarking job to run every day on the cluster and proactively predict any bottlenecks. This can be...

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Hadoop 2.x Administration Cookbook
Published in: May 2017Publisher: PacktISBN-13: 9781787126732
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Aman Singh

Gurmukh Singh is a seasoned technology professional with 14+ years of industry experience in infrastructure design, distributed systems, performance optimization, and networks. He has worked in big data domain for the last 5 years and provides consultancy and training on various technologies. He has worked with companies such as HP, JP Morgan, and Yahoo. He has authored Monitoring Hadoop by Packt Publishing
Read more about Aman Singh