Home Cloud & Networking Hadoop 2.x Administration Cookbook

Hadoop 2.x Administration Cookbook

By Aman Singh
books-svg-icon Book
eBook $43.99 $29.99
Print $54.99
Subscription $15.99 $10 p/m for three months
$10 p/m for first 3 months. $15.99 p/m after that. Cancel Anytime!
What do you get with a Packt Subscription?
This book & 7000+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with a Packt Subscription?
This book & 6500+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with eBook + Subscription?
Download this book in EPUB and PDF formats, plus a monthly download credit
This book & 6500+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with a Packt Subscription?
This book & 6500+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with eBook?
Download this book in EPUB and PDF formats
Access this title in our online reader
DRM FREE - Read whenever, wherever and however you want
Online reader with customised display settings for better reading experience
What do you get with video?
Download this video in MP4 format
Access this title in our online reader
DRM FREE - Watch whenever, wherever and however you want
Online reader with customised display settings for better learning experience
What do you get with video?
Stream this video
Access this title in our online reader
DRM FREE - Watch whenever, wherever and however you want
Online reader with customised display settings for better learning experience
What do you get with Audiobook?
Download a zip folder consisting of audio files (in MP3 Format) along with supplementary PDF
What do you get with Exam Trainer?
Flashcards, Mock exams, Exam Tips, Practice Questions
Access these resources with our interactive certification platform
Mobile compatible-Practice whenever, wherever, however you want
BUY NOW $10 p/m for first 3 months. $15.99 p/m after that. Cancel Anytime!
eBook $43.99 $29.99
Print $54.99
Subscription $15.99 $10 p/m for three months
What do you get with a Packt Subscription?
This book & 7000+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with a Packt Subscription?
This book & 6500+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with eBook + Subscription?
Download this book in EPUB and PDF formats, plus a monthly download credit
This book & 6500+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with a Packt Subscription?
This book & 6500+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with eBook?
Download this book in EPUB and PDF formats
Access this title in our online reader
DRM FREE - Read whenever, wherever and however you want
Online reader with customised display settings for better reading experience
What do you get with video?
Download this video in MP4 format
Access this title in our online reader
DRM FREE - Watch whenever, wherever and however you want
Online reader with customised display settings for better learning experience
What do you get with video?
Stream this video
Access this title in our online reader
DRM FREE - Watch whenever, wherever and however you want
Online reader with customised display settings for better learning experience
What do you get with Audiobook?
Download a zip folder consisting of audio files (in MP3 Format) along with supplementary PDF
What do you get with Exam Trainer?
Flashcards, Mock exams, Exam Tips, Practice Questions
Access these resources with our interactive certification platform
Mobile compatible-Practice whenever, wherever, however you want
  1. Free Chapter
    Hadoop Architecture and Deployment
About this book
Hadoop enables the distributed storage and processing of large datasets across clusters of computers. Learning how to administer Hadoop is crucial to exploit its unique features. With this book, you will be able to overcome common problems encountered in Hadoop administration. The book begins with laying the foundation by showing you the steps needed to set up a Hadoop cluster and its various nodes. You will get a better understanding of how to maintain Hadoop cluster, especially on the HDFS layer and using YARN and MapReduce. Further on, you will explore durability and high availability of a Hadoop cluster. You’ll get a better understanding of the schedulers in Hadoop and how to configure and use them for your tasks. You will also get hands-on experience with the backup and recovery options and the performance tuning aspects of Hadoop. Finally, you will get a better understanding of troubleshooting, diagnostics, and best practices in Hadoop administration. By the end of this book, you will have a proper understanding of working with Hadoop clusters and will also be able to secure, encrypt it, and configure auditing for your Hadoop clusters.
Publication date:
May 2017
Publisher
Packt
Pages
348
ISBN
9781787126732

 

Chapter 1. Hadoop Architecture and Deployment

In this chapter, we will cover the following recipes:

  • Overview of Hadoop Architecture

  • Building and compiling Hadoop

  • Installation methods

  • Setting up host resolution

  • Installing a single-node cluster - HDFS components

  • Installing a single-node cluster - YARN components

  • Installing a multi-node cluster

  • Configuring Hadoop Gateway node

  • Decommissioning nodes

  • Adding nodes to the cluster

 

Introduction


As Hadoop is a distributed system with many components, and has a reputation of getting quite complex, it is important to understand the basic Architecture before we start with the deployments.

In this chapter, we will take a look at the Architecture and the recipes to deploy a Hadoop cluster in various modes. This chapter will also cover recipes on commissioning and decommissioning nodes in a cluster.

The recipes in this chapter will primarily focus on deploying a cluster based on an Apache Hadoop distribution, as it is the best way to learn and explore Hadoop.

Note

While the recipes in this chapter will give you an overview of a typical configuration, we encourage you to adapt this design according to your needs. The deployment directory structure varies according to IT policies within an organization. All our deployments will be based on the Linux operating system, as it is the most commonly used platform for Hadoop in production. You can use any flavor of Linux; the recipes are very generic in nature and should work on all Linux flavors, with the appropriate changes in path and installation methods, such as yum or apt-get.

Overview of Hadoop Architecture

Hadoop is a framework and not a tool. It is a combination of various components, such as a filesystem, processing engine, data ingestion tools, databases, workflow execution tools, and so on. Hadoop is based on client-server Architecture with a master node for each storage layer and processing layer.

Namenode is the master for Hadoop distributed file system (HDFS) storage and ResourceManager is the master for YARN (Yet Another Resource Negotiator). The Namenode stores the file metadata and the actual blocks/data reside on the slave nodes called Datanodes. All the jobs are submitted to the ResourceManager and it then assigns tasks to its slaves, called NodeManagers. In a highly available cluster, we can have more than one Namenode and ResourceManager.

Both masters are each a single point of failure, which makes them very critical components of the cluster and so care must be taken to make them highly available.

Although there are many concepts to learn, such as application masters, containers, schedulers, and so on, as this is a recipe book, we will keep the theory to a minimum.

 

Building and compiling Hadoop


The pre-build Hadoop binary available at www.apache.org, is a 32-bit version and is not suitable for the 64-bit hardware as it will not be able to utilize the entire addressable memory. Although, for lab purposes, we can use the 32-bit version, it will keep on giving warnings about the "not being built for the native library", which can be safely ignored.

In production, we will always be running Hadoop on hardware which is a 64-bit version and can support larger amounts of memory. To properly utilize memory higher than 4 GB on any node, we need the 64-bit complied version of Hadoop.

Getting ready

To step through the recipes in this chapter, or indeed the entire book, you will need at least one preinstalled Linux instance. You can use any distribution of Linux, such as Ubuntu, CentOS, or any other Linux flavor that the reader is comfortable with. The recipes are very generic and are expected to work with all distributions, although, as stated before, one may need to use distro-specific commands. For example, for package installation in CentOS we use yum package installer, or in Debian-based systems we use apt-get, and so on. The user is expected to know basic Linux commands and should know how to set up package repositories such as the yum repository. The user should also know how the DNS resolution is configured. No other prerequisites are required.

How to do it...

  1. ssh to the Linux instance using any of the ssh clients. If you are on Windows, you need PuTTY. If you are using a Mac or Linux, there is a default terminal available to use ssh. The following command connects to the host with an IP of 10.0.0.4. Change it to whatever the IP is in your case:

    $ ssh root@10.0.0.4
    
  2. Change to the user root or any other privileged user:

    $ sudo su -
    
  3. Install the dependencies to build Hadoop:

    # yum install gcc gcc-c++ openssl-devel make cmake jdk-1.7u45(minimum)
    
  4. Download and install Maven:

    wget mirrors.gigenet.com/apache/maven/maven-3/3.3.9/binaries/apache-maven-3.3.9-bin.tar.gz
    
  5. Untar Maven:

    # tar -zxf apache-maven-3.3.9-bin.tar.gz -C /opt/
    
  6. Set up the Maven environment:

    # cat /etc/profile.d/maven.sh
    export JAVA_HOME=/usr/java/latest
    export M3_HOME=/opt/apache-maven-3.3.9
    export PATH=$JAVA_HOME/bin:/opt/apache-maven-3.3.9/bin:$PATH
    
  7. Download and set up protobuf:

    # wget https://github.com/google/protobuf/releases/download/v2.5.0/protobuf-2.5.0.tar.gz
    # tar -xzf protobuf-2.5.0.tar.gz -C /root
    # cd /opt/protobuf-2.5.0/
    # ./configure
    # make;make install
    
  8. Download the latest Hadoop stable source code. At the time of writing, the latest Hadoop version is 2.7.3:

    # wget apache.uberglobalmirror.com/hadoop/common/stable2/hadoop-2.7.3-src.tar.gz
    # tar -xzf hadoop-2.7.3-src.tar.gz -C /opt/
    # cd /opt/hadoop-2.7.2-src
    # mvn package -Pdist,native -DskipTests -Dtar
    
  9. You will see a tarball in the folder hadoop-2.7.3-src/hadoop-dist/target/.

How it works...

The tarball package created will be used for the installation of Hadoop throughout the book. It is not mandatory to build a Hadoop from source, but by default the binary packages provided by Apache Hadoop are 32-bit versions. For production, it is important to use a 64-bit version so as to fully utilize the memory beyond 4 GB and to gain other performance benefits.

 

Installation methods


Hadoop can be installed in multiple ways, either by using repository methods such as Yum/apt-get or by extracting the tarball packages. The project Bigtop http://bigtop.apache.org/ provides Hadoop packages for infrastructure, and can be used by creating a local repository of the packages.

All the steps are to be performed as the root user. It is expected that the user knows how to set up a yum repository and Linux basics.

Getting ready

You are going to need a Linux machine. You can either use the one which has been used in the previous task or set up a new node, which will act as repository server and host all the packages we need.

How to do it...

  1. Connect to a Linux machine that has at least 5 GB disk space to store the packages.

  2. If you are on CentOS or a similar distribution, make sure you have the package yum-utils installed. This package will provide the command reposync.

  3. Create a file bigtop.repo under /etc/yum.repos.d/. Note that the file name can be anything—only the extension must be .repo.

  4. See the following screenshot for the contents of the file:

  5. Execute the command reposync –r bigtop. It will create a directory named bigtop under the present working directory with all the packages downloaded to it.

  6. All the required Hadoop packages can be installed by configuring the repository we downloaded as a repository server.

How it works...

From step 2 to step 6, the user will be able to configure and use the Hadoop package repository. Setting up a Yum repository is not required, but it makes things easier if we have to do installations on hundreds of nodes. In larger setups, management systems such as Puppet or Chef will be used for deployment configuration to push configuration and packages to nodes.

In this chapter, we will be using the tarball package that was built in the first section to perform installations. This is the best way of learning about directory structure and the configurations needed.

 

Setting up host resolution


Before we start with the installations, it is important to make sure that the host resolution is configured and working properly.

Getting ready

Choose any appropriate hostnames the user wants for his or her Linux machines. For example, the hostnames could be master1.cluster.com or rt1.cyrus.com or host1.example.com. The important thing is that the hostnames must resolve.

This resolution can be done using a DNS server or by configuring the/etc/hosts file on each node we use for our cluster setup.

The following steps will show you how to set up the resolution in the/etc/hosts file.

How to do it...

  1. Connect to the Linux machine and change the hostname to master1.cyrus.com in the file as follows:

  2. Edit the/etc/hosts file as follows:

  3. Make sure the resolution returns an IP address:

    # getent hosts master1.cyrus.com
    
  4. The other preferred method is to set up the DNS resolution so that we do not have to populate the hosts file on each node. In the example resolution shown here, the user can see that the DNS server is configured to answer the domain cyrus.com:

    # nslookup master1.cyrus.com
    Server:		10.0.0.2
    Address:	10.0.0.2#53
    
    Non-authoritative answer:
    Name:	master1.cyrus.com
    Address: 10.0.0.104
    

How it works...

Each Linux host has a resolver library that helps it resolve any hostname that is asked for. It contacts the DNS server, and if it is not found there, it contacts the hosts file. Users who are not Linux administrators can simply use the hosts files as a workaround to set up a Hadoop cluster. There are many resources available online that could help you to set up a DNS quickly if needed.

Once the resolution is in place, we will start with the installation of Hadoop on a single-node and then progress to multiple nodes.

 

Installing a single-node cluster - HDFS components


Usually the term cluster means a group of machines, but in this recipe, we will be installing various Hadoop daemons on a single node. The single machine will act as both the master and slave for the storage and processing layer.

Getting ready

You will need some information before stepping through this recipe.

Although Hadoop can be configured to run as root user, it is a good practice to run it as a non-privileged user. In this recipe, we are using the node name nn1.cluster1.com, preinstalled with CentOS 6.5.

Tip

Create a system user named hadoop and set a password for that user.

Install JDK, which will be used by Hadoop services. The minimum recommended version of JDK is 1.7, but Open JDK can also be used.

How to do it...

  1. Log into the machine/host as root user and install jdk:

    # yum install jdk –y
    or it can also be installed using the command as below
    # rpm –ivh jdk-1.7u45.rpm
    
  2. Once Java is installed, make sure Java is in PATH for execution. This can be done by setting JAVA_HOME and exporting it as an environment variable. The following screenshot shows the content of the directory where Java gets installed:

    # export JAVA_HOME=/usr/java/latest
    
  3. Now we need to copy the tarball hadoop-2.7.3.tar.gz--which was built in the Build Hadoop section earlier in this chapter—to the home directory of the user root. For this, the user needs to login to the node where Hadoop was built and execute the following command:

    # scp –r hadoop-2.7.3.tar.gz root@nn1.cluster1.com:~/
    
  4. Create a directory named/opt/cluster to be used for Hadoop:

    # mkdir –p /opt/cluster
    
  5. Then untar the hadoop-2.7.3.tar.gz to the preceding created directory:

    # tar –xzvf hadoop-2.7.3.tar.gz  -C /opt/Cluster/
    
  6. Create a user named hadoop, if you haven't already, and set the password as hadoop:

    # useradd hadoop
    # echo hadoop | passwd --stdin hadoop
    
  7. As step 6 was done by the root user, the directory and file under /opt/cluster will be owned by the root user. Change the ownership to the Hadoop user:

    # chown -R hadoop:hadoop /opt/cluster/
    
  8. If the user lists the directory structure under /opt/cluster, he will see it as follows:

  9. The directory structure under /opt/cluster/hadoop-2.7.3 will look like the one shown in the following screenshot:

  10. The listing shows etc, bin, sbin, and other directories.

  11. The etc/hadoop directory is the one that contains the configuration files for configuring various Hadoop daemons. Some of the key files are core-site.xml, hdfs-site.xml, hadoop-env.xml, and mapred-site.xml among others, which will be explained in the later sections:

  12. The directories bin and sbin contain executable binaries, which are used to start and stop Hadoop daemons and perform other operations such as filesystem listing, copying, deleting, and so on:

  13. To execute a command /opt/cluster/hadoop-2.7.3/bin/hadoop, a complete path to the command needs to be specified. This could be cumbersome, and can be avoided by setting the environment variable HADOOP_HOME.

  14. Similarly, there are other variables that need to be set that point to the binaries and the configuration file locations:

  15. The environment file is set up system-wide so that any user can use the commands. Once the hadoopenv.sh file is in place, execute the command to export the variables defined in it:

  16. Change to the Hadoop user using the command su – hadoop:

  17. Change to the /opt/cluster directory and create a symlink:

  18. To verify that the preceding changes are in place, the user can execute either the which Hadoop or which java commands, or the user can execute the command hadoop directly without specifying the complete path.

  19. In addition to setting the environment as discussed, the user has to add the JAVA_HOME variable in the hadoop-env.sh file.

  20. The next thing is to set up the Namenode address, which specifies the host:port address on which it will listen. This is done using the file core-site.xml:

  21. The important thing to keep in mind is the property fs.defaultFS.

  22. The next thing that the user needs to configure is the location where Namenode will store its metadata. This can be any location, but it is recommended that you always have a dedicated disk for it. This is configured in the file hdfs-site.xml:

  23. The next step is to format the Namenode. This will create an HDFS file system:

    $ hdfs namenode -format
    
  24. Similarly, we have to add the rule for the Datanode directory under hdfs-site.xml. Nothing needs to be done to the core-site.xml file:

  25. Then the services need to be started for Namenode and Datanode:

    $ hadoop-daemon.sh start namenode
    $ hadoop-daemon.sh start datanode
    
  26. The command jps can be used to check for running daemons:

How it works...

The master Namenode stores metadata and the slave node Datanode stores the blocks. When the Namenode is formatted, it creates a data structure that contains fsimage, edits, and VERSION. These are very important for the functioning of the cluster.

The parameters dfs.data.dir and dfs.datanode.data.dir are used for the same purpose, but are used across different versions. The older parameters are deprecated in favor of the newer ones, but they will still work. The parameter dfs.name.dir has been deprecated in favor of dfs.namenode.name.dir in Hadoop 2.x. The intention of showing both versions of the parameter is to bring to the user's notice that parameters are evolving and ever changing, and care must be taken by referring to the release notes for each Hadoop version.

There's more...

Setting up ResourceManager and NodeManager

In the preceding recipe, we set up the storage layer—that is, the HDFS for storing data—but what about the processing layer?. The data on HDFS needs to be processed to make a meaningful decision using MapReduce, Tez, Spark, or any other tool. To run the MapReduce, Spark or other processing framework we need to have ResourceManager, NodeManager.

 

Installing a single-node cluster - YARN components


In the previous recipe, we discussed how to set up Namenode and Datanode for HDFS. In this recipe, we will be covering how to set up YARN on the same node.

After completing this recipe, there will be four daemons running on the nn1.cluster1.com node, namely namenode, datanode, resourcemanager, and nodemanager daemons.

Getting ready

For this recipe, you will again use the same node on which we have already configured the HDFS layer.

All operations will be done by the hadoop user.

How to do it...

  1. Log in to the node nn1.cluster1.com and change to the hadoop user.

  2. Change to the /opt/cluster/hadoop/etc/hadoop directory and configure the files mapred-site.xml and yarn-site.xml:

  3. The file yarn-site.xml specifies the shuffle class, scheduler, and resource management components of the ResourceManager. You only need to specify yarn.resourcemanager.address; the rest are automatically picked up by the ResourceManager. You can see from the following screenshot that you can separate them into their independent components:

  4. Once the configurations are in place, the resourcemanager and nodemanager daemons need to be started:

  5. The environment variables that were defined by /etc/profile.d/hadoopenv.sh included YARN_HOME and YARN_CONF_DIR, which let the framework know about the location of the YARN configurations.

How it works...

The nn1.cluster1.com node is configured to run HDFS and YARN components. Any file that is copied to the HDFS will be split into blocks and stored on Datanode. The metadata of the file will be on the Namenode.

Any operation performed on a text file, such as word count, can be done by running a simple MapReduce program, which will be submitted to the single node cluster using the ResourceManager daemon and executed by the NodeManager. There are a lot of steps and details as to what goes on under the hood, which will be covered in the coming chapters.

Note

The single-node cluster is also called pseudo-distributed cluster.

There's more...

A quick check can be done on the functionality of HDFS. You can create a simple text file and upload it to HDFS to see whether it is successful or not:

$ hadoop fs –put test.txt /

This will copy the file test.txt to the HDFS. The file can be read directly from HDFS:

$ hadoop fs –ls /
$ hadoop fs –cat /test.txt

See also

  • The Installing multi-node cluster recipe

 

Installing a multi-node cluster


In the previous recipes, we looked at how to configure a single-node Hadoop cluster, also referred to as pseudo-distributed cluster. In this recipe, we will set up a fully distributed cluster with each daemon running on separate nodes.

There will be one node for Namenode, one for ResourceManager, and four nodes will be used for Datanode and NodeManager. In production, the number of Datanodes could be in the thousands, but here we are just restricted to four nodes. The Datanode and NodeManager coexist on the same nodes for the purposes of data locality and locality of reference.

Getting ready

Make sure that the six nodes the user chooses have JDK installed, with name resolution working. This could be done by making entries in the /etc/hosts file or using DNS.

In this recipe, we are using the following nodes:

  • Namenode: nn1.cluster1.com

  • ResourceManager: jt1.cluster1.com

  • Datanodes and NodeManager: dn[1-4].cluster1.com

How to do it...

  1. Make sure all the nodes have the hadoop user.

  2. Create the directory structure /opt/cluster on all the nodes.

  3. Make sure the ownership is correct for /opt/cluster.

  4. Copy the /opt/cluster/hadoop-2.7.3 directory from the nn1.cluster.com to all the nodes in the cluster:

    $ for i in 192.168.1.{72..75};do scp -r hadoop-2.7.3 $i:/opt/cluster/ $i;done
    
  5. The preceding IPs belong to the nodes in the cluster. The user needs to modify them accordingly. Also, to prevent it from prompting for password for each node, it is good to set up pass phraseless access between each node.

  6. Change to the directory /opt/cluster and create a symbolic link on each node:

    $ ln –s hadoop-2.7.3 hadoop
    
  7. Make sure that the environment variables have been set up on all nodes:

    $ . /etc/profile.d/hadoopenv.sh
    
  8. On Namenode, only the parameters specific to it are needed.

  9. The file core-site.xml remains the same across all nodes in the cluster.

  10. On Namenode, the file hdfs-site.xml changes as follows:

  11. On Datanode, the file hdfs-site.xml changes as follows:

  12. On Datanodes, the file yarn-site.xml changes as follows:

  13. On the node jt1, which is ResourceManager, the file yarn-site.xml is as follows:

  14. To start Namenode on nn1.cluster1.com, enter the following:

    $ hadoop-daemon.sh start namenode
    
  15. To start Datanode and NodeManager on dn[1-4], enter the following:

    $ hadoop-daemon.sh start datanode
    $ yarn-daemon.sh start nodemanager
    
  16. To start ResourceManager on jt1.cluster.com, enter the following:

    $ yarn-daemon.sh start resourcemanager
    
  17. On each node, execute the command jps to see the daemons running on them. Make sure you have the correct services running on each node.

  18. Create a text file test.txt and copy it to HDFS using hadoop fs –put test.txt /. This confirms that HDFS is working fine.

  19. To verify that YARN has been set up correctly, run the simple "Pi" estimation program:

    $ yarn jar /opt/cluster/hadoop/share/hadoop/mapreduce/hadoop-example.jar Pi 3 3
    

How it works...

Steps 1 through 7 copy the already extracted and configured Hadoop files to other nodes in the cluster. From step 8 onwards, each node is configured according to the role it plays in the cluster.

The user should see four Datanodes reporting to the cluster, and should also be able to access the UI of the Namenode on port 50070 and on port 8088 for ResourceManager.

To see the number of nodes talking to Namenode, enter the following:

$ hdfs dfsadmin -report
  Configured Capacity: 9124708352 (21.50 GB)
  Present Capacity: 5923942400 (20.52 GB)
  DFS Remaining: 5923938304 (20.52 GB)
  DFS Used: 4096 (4 KB)
  DFS Used%: 0.00%
Live datanodes (4):

The same information can also be retrieved using the Namenode Web UI as shown in the following screenshot:

Note

The user can configure any customer port for any service, but there should be a good reason to change the defaults.

 

Configuring the Hadoop Gateway node


Hadoop Gateway or edge node is a node that connects to the Hadoop cluster, but does not run any of the daemons. The purpose of an edge node is to provide an access point to the cluster and prevent users from a direct connection to critical components such as Namenode or Datanode.

Another important reason for its use is the data distribution across the cluster. If a user connects to a Datanode and performs the data copy operation hadoop fs –put file /, then one copy of the file will always go to the Datanode from which the copy command was executed. This will result in an imbalance of data across the node. If we upload a file from a node that is not a Datanode, then data will be distributed evenly for all copies of data.

In this recipe, we will configure an edge node for a Hadoop cluster.

Getting ready

For the edge node, the user needs a separate Linux machine with Java installed and the user hadoop in place.

How to do it...

  1. ssh to the new node that is to be configured as Gateway node. For example, the node name could be client1.cluster1.com.

  2. Set up the environment variable as discussed before. This can be done by setting the /etc/profile.d/hadoopenv.sh file.

  3. Copy the already configured directory hadoop-2.7.3 from Namenode to this node (client1.cluster1.com). This avoids doing all the configuration for files such as core-site.xml and yarn-site.xml.

  4. The edge node just needs to know about the two master nodes of Namenode and ResourceManager. It does not need any other configuration for the time being. It does not store any data locally, unlike Namenode and Datanode.

  5. It only needs to write temporary files and logs. In later chapters, we will see other parameters for MapReduce and performance tuning that go on this node.

  6. Create a symbolic link ln –s hadoop-2.7.3 hadoop so that the commands and Hadoop configuration files are visible.

  7. There will be no daemon started on this node. Execute a command from the edge node to make sure the user can connect to hadoop fs –ls /.

  8. To verify that the edge node has been set up correctly, run the simple "Pi" estimation program from the edge node:

    $ yarn jar /opt/cluster/hadoop/share/hadoop/mapreduce/hadoop-example.jar Pi 3 3
    

How it works...

The edge node or the Gateway node connects to Namenode for all HDFS-related operation and connects to ResourceManager for submitted jobs to the cluster.

In production, there will be more than one edge node connecting to the cluster for high availability. This is can be done by using a load balancer or DNS round-robin. No user should run any local jobs on the edge nodes or use it for doing non Hadoop-related tasks.

See also

Edge node can be used to configure many additional components, such as PIG, Hive, Sqoop, rather than installing them on the main cluster nodes like Namenode, Datanode. This way it is easy to segregate the complexity and restrict access to just edge node.

  • The Configuring Hive recipe

 

Decommissioning nodes


There will always be failures in clusters, such as hardware issues or a need to upgrade nodes. This should be done in a graceful manner, without any data loss.

When the Datanode daemon is stopped on a Datanode, it takes approximately ten minutes for the Namenode to remove that node. This has to do with the heartbeat retry interval. At any time, we can abruptly remove the Datanode, but it can result in data loss.

It is recommended that you opt for the graceful removal of the node from the cluster, as this ensures that all the data on that node is drained.

Getting ready

For the following steps, we assume that the cluster that is up and running with Datanodes is in a healthy state and the one with the Datanode dn1.cluster1.com needs maintenance and must be removed from the cluster. We will login to the Namenode and make changes there.

How to do it...

  1. ssh to Namenode and edit the file hdfs-site.xml by adding the following property to it:

    <property>
    <name>dfs.hosts.exclude</name>
    <value>/home/hadoop/excludes</value>
    <final>true</final>
    </property>
  2. Make sure the file excludes is readable by the user hadoop.

  3. Restart the Namenode daemon for the property to take effect:

    $ hadoop-daemons.sh stop namenode
    $ hadoop-daemons.sh start namenode
    
  4. A restart of Namenode is required only when any property is changed in the file. Once the property is in place, Namenode can read the changes to the contents of the file excludes by simply refreshing nodes.

  5. Add the dn1.cluster1.com node to the file excludes:

    $ cat excludes
    dn1.cluster1.com
    
  6. After adding the node to the file, we just need to reload the file by doing the following:

    $ hadoop dfsadmin -refreshNodes
    
  7. After sometime, the node will be decommissioned. The time will vary according to the data the particular Datanode had. We can see the decommissioned nodes using the following:

    $ hdfs dfsadmin -report
    
  8. The preceding command will list the nodes in the cluster, and against the dn1.cluster1.com node we can see that its status will either be decommissioning or decommissioned.

How it works...

Let's have a look at what we did throughout this recipe.

In steps 1 through 6, we added the new property to the hdfs-site.xml file and then restarted Namenode to make it aware of the changes. Once the property is in place, the Namenode is aware of the excludes file, and it can be asked to re-read by simply refreshing the node list, as done in step 6.

With these steps, the data on the Datanode dn1.cluster1.com will be moved to other nodes in the cluster, and once the data has been drained, the Datanode daemon on the node will be shutdown. During the process, the node will change the status from normal to decommissioning and then to decommissioned.

Care must be taken while decommissioning nodes in the cluster. The user should not decommission multiple nodes at a time as this will generate lot of network traffic and cause congestion and data loss.

See also

  • The Add nodes to the cluster recipe

 

Adding nodes to the cluster


Over a period of time, our cluster will grow in data and there will be a need to increase the capacity of the cluster by adding more nodes.

We can add Datanodes to the cluster in the same way that we first configured the Datanode started the Datanode daemon on it. But the important thing to keep in mind is that all nodes can be part of the cluster. It should not be that anyone can just start a Datanode daemon on his laptop and join the cluster, as it will be disastrous. By default, there is nothing preventing any node being a Datanode, as the user has just to untar the Hadoop package and point the file "core-site.xml" to the Namenode and start the Datanode daemon.

Getting ready

For the following steps, we assume that the cluster that is up and running with Datanodes is in a healthy state and we need to add a new Datanode in the cluster. We will login to the Namenode and make changes there.

How to do it...

  1. ssh to Namenode and edit the file hdfs-site.xml to add the following property to it:

    <property>
    <name>dfs.hosts</name>
    <value>/home/hadoop/includes</value>
    <final>true</final>
    </property>
  2. Make sure the file includes is readable by the user hadoop.

  3. Restart the Namenode daemon for the property to take effect:

    $ hadoop-daemons.sh stop namenode
    $ hadoop-daemons.sh start namenode
    
  4. A restart of Namenode is required only when any property is changed in the file. Once the property is in place, Namenode can read the changes to the contents of the includes file by simply refreshing the nodes.

  5. Add the dn1.cluster1.com node to the file excludes:

    $ cat includes
    dn1.cluster1.com
    
  6. The file includes or excludes can contain a list of multiple nodes, one node per line.

  7. After adding the node to the file, we just need to reload the file by entering the following:

    $ hadoop dfsadmin -refreshNodes
    
  8. After some time, the node will be available in the cluster and can be seen:

    $ hdfs dfsadmin -report
    

How it works...

The file /home/hadoop/includes will contain a list of all the Datanodes that are allowed to join a cluster. If the file includes is blank, then all Datanodes are allowed to join the cluster. If there is both an include and exclude file, the list of nodes must be mutually exclusive in both the files. So, to decommission the node dn1.cluster.com from the cluster, it must be removed from the includes file and added to the excludes file.

There's more...

In addition to controlling the nodes as we described, there will be firewall rules in place and separate VLANs for Hadoop clusters to keep the traffic and data isolated.

About the Author
  • Aman Singh

    Gurmukh Singh is a seasoned technology professional with 14+ years of industry experience in infrastructure design, distributed systems, performance optimization, and networks. He has worked in big data domain for the last 5 years and provides consultancy and training on various technologies. He has worked with companies such as HP, JP Morgan, and Yahoo. He has authored Monitoring Hadoop by Packt Publishing

    Browse publications by this author
Latest Reviews (3 reviews total)
Très bien, c'est un bon livre
Great tutorials that makes it easy to learn.
Já li 100 páginas, o livro orienta perfeitamente a extensa gama de recursos de administração do Haddop, e ainda prepara para a certificação Cloudera Admin, perfeito!
Hadoop 2.x Administration Cookbook
Unlock this book and the full library FREE for 7 days
Start now