In this chapter, we will cover the following recipes:
Overview of Hadoop Architecture
Building and compiling Hadoop
Installation methods
Setting up host resolution
Installing a single-node cluster - HDFS components
Installing a single-node cluster - YARN components
Installing a multi-node cluster
Configuring Hadoop Gateway node
Decommissioning nodes
Adding nodes to the cluster
As Hadoop is a distributed system with many components, and has a reputation of getting quite complex, it is important to understand the basic Architecture before we start with the deployments.
In this chapter, we will take a look at the Architecture and the recipes to deploy a Hadoop cluster in various modes. This chapter will also cover recipes on commissioning and decommissioning nodes in a cluster.
The recipes in this chapter will primarily focus on deploying a cluster based on an Apache Hadoop distribution, as it is the best way to learn and explore Hadoop.
Note
While the recipes in this chapter will give you an overview of a typical configuration, we encourage you to adapt this design according to your needs. The deployment directory structure varies according to IT policies within an organization. All our deployments will be based on the Linux operating system, as it is the most commonly used platform for Hadoop in production. You can use any flavor of Linux; the recipes are very generic in nature and should work on all Linux flavors, with the appropriate changes in path and installation methods, such as yum
or apt-get
.
Hadoop is a framework and not a tool. It is a combination of various components, such as a filesystem, processing engine, data ingestion tools, databases, workflow execution tools, and so on. Hadoop is based on client-server Architecture with a master node for each storage layer and processing layer.
Namenode is the master for Hadoop distributed file system (HDFS) storage and ResourceManager is the master for YARN (Yet Another Resource Negotiator). The Namenode stores the file metadata and the actual blocks/data reside on the slave nodes called Datanodes. All the jobs are submitted to the ResourceManager and it then assigns tasks to its slaves, called NodeManagers. In a highly available cluster, we can have more than one Namenode and ResourceManager.
Both masters are each a single point of failure, which makes them very critical components of the cluster and so care must be taken to make them highly available.
Although there are many concepts to learn, such as application masters, containers, schedulers, and so on, as this is a recipe book, we will keep the theory to a minimum.
The pre-build Hadoop binary available at www.apache.org, is a 32-bit version and is not suitable for the 64-bit hardware as it will not be able to utilize the entire addressable memory. Although, for lab purposes, we can use the 32-bit version, it will keep on giving warnings about the "not being built for the native library", which can be safely ignored.
In production, we will always be running Hadoop on hardware which is a 64-bit version and can support larger amounts of memory. To properly utilize memory higher than 4 GB on any node, we need the 64-bit complied version of Hadoop.
To step through the recipes in this chapter, or indeed the entire book, you will need at least one preinstalled Linux instance. You can use any distribution of Linux, such as Ubuntu, CentOS, or any other Linux flavor that the reader is comfortable with. The recipes are very generic and are expected to work with all distributions, although, as stated before, one may need to use distro-specific commands. For example, for package installation in CentOS we use yum
package installer, or in Debian-based systems we use apt-get
, and so on. The user is expected to know basic Linux commands and should know how to set up package repositories such as the yum
repository. The user should also know how the DNS resolution is configured. No other prerequisites are required.
ssh
to the Linux instance using any of the ssh clients. If you are on Windows, you need PuTTY. If you are using a Mac or Linux, there is a default terminal available to usessh
. The following command connects to the host with an IP of10.0.0.4
. Change it to whatever the IP is in your case:$ ssh root@10.0.0.4
Change to the user root or any other privileged user:
$ sudo su -
Install the dependencies to build Hadoop:
# yum install gcc gcc-c++ openssl-devel make cmake jdk-1.7u45(minimum)
Download and install Maven:
wget mirrors.gigenet.com/apache/maven/maven-3/3.3.9/binaries/apache-maven-3.3.9-bin.tar.gz
Untar Maven:
# tar -zxf apache-maven-3.3.9-bin.tar.gz -C /opt/
Set up the Maven environment:
# cat /etc/profile.d/maven.sh export JAVA_HOME=/usr/java/latest export M3_HOME=/opt/apache-maven-3.3.9 export PATH=$JAVA_HOME/bin:/opt/apache-maven-3.3.9/bin:$PATH
Download and set up
protobuf
:# wget https://github.com/google/protobuf/releases/download/v2.5.0/protobuf-2.5.0.tar.gz # tar -xzf protobuf-2.5.0.tar.gz -C /root # cd /opt/protobuf-2.5.0/ # ./configure # make;make install
Download the latest Hadoop stable source code. At the time of writing, the latest Hadoop version is 2.7.3:
# wget apache.uberglobalmirror.com/hadoop/common/stable2/hadoop-2.7.3-src.tar.gz # tar -xzf hadoop-2.7.3-src.tar.gz -C /opt/ # cd /opt/hadoop-2.7.2-src # mvn package -Pdist,native -DskipTests -Dtar
You will see a tarball in the folder
hadoop-2.7.3-src/hadoop-dist/target/
.
The tarball package created will be used for the installation of Hadoop throughout the book. It is not mandatory to build a Hadoop from source, but by default the binary packages provided by Apache Hadoop are 32-bit versions. For production, it is important to use a 64-bit version so as to fully utilize the memory beyond 4 GB and to gain other performance benefits.
Hadoop can be installed in multiple ways, either by using repository methods such as Yum
/apt-get
or by extracting the tarball packages. The project Bigtop http://bigtop.apache.org/ provides Hadoop packages for infrastructure, and can be used by creating a local repository of the packages.
All the steps are to be performed as the root user. It is expected that the user knows how to set up a yum
repository and Linux basics.
You are going to need a Linux machine. You can either use the one which has been used in the previous task or set up a new node, which will act as repository server and host all the packages we need.
Connect to a Linux machine that has at least 5 GB disk space to store the packages.
If you are on CentOS or a similar distribution, make sure you have the package
yum-utils
installed. This package will provide the commandreposync
.Create a file
bigtop.repo
under/etc/yum.repos.d/
. Note that the file name can be anything—only the extension must be.repo
.See the following screenshot for the contents of the file:
Execute the command
reposync –r bigtop
. It will create a directory namedbigtop
under the present working directory with all the packages downloaded to it.All the required Hadoop packages can be installed by configuring the repository we downloaded as a repository server.
From step 2 to step 6, the user will be able to configure and use the Hadoop package repository. Setting up a Yum
repository is not required, but it makes things easier if we have to do installations on hundreds of nodes. In larger setups, management systems such as Puppet or Chef will be used for deployment configuration to push configuration and packages to nodes.
In this chapter, we will be using the tarball package that was built in the first section to perform installations. This is the best way of learning about directory structure and the configurations needed.
Before we start with the installations, it is important to make sure that the host resolution is configured and working properly.
Choose any appropriate hostnames the user wants for his or her Linux machines. For example, the hostnames could be master1.cluster.com
or rt1.cyrus.com
or host1.example.com
. The important thing is that the hostnames must resolve.
This resolution can be done using a DNS server or by configuring the/etc/hosts
file on each node we use for our cluster setup.
The following steps will show you how to set up the resolution in the/etc/hosts
file.
Connect to the Linux machine and change the hostname to
master1.cyrus.com
in the file as follows:Edit the
/etc/hosts
file as follows:Make sure the resolution returns an IP address:
# getent hosts master1.cyrus.com
The other preferred method is to set up the DNS resolution so that we do not have to populate the
hosts
file on each node. In the example resolution shown here, the user can see that the DNS server is configured to answer the domaincyrus.com
:# nslookup master1.cyrus.com Server: 10.0.0.2 Address: 10.0.0.2#53 Non-authoritative answer: Name: master1.cyrus.com Address: 10.0.0.104
Each Linux host has a resolver library that helps it resolve any hostname that is asked for. It contacts the DNS server, and if it is not found there, it contacts the hosts
file. Users who are not Linux administrators can simply use the hosts
files as a workaround to set up a Hadoop cluster. There are many resources available online that could help you to set up a DNS quickly if needed.
Once the resolution is in place, we will start with the installation of Hadoop on a single-node and then progress to multiple nodes.
Usually the term cluster means a group of machines, but in this recipe, we will be installing various Hadoop daemons on a single node. The single machine will act as both the master and slave for the storage and processing layer.
You will need some information before stepping through this recipe.
Although Hadoop can be configured to run as root user, it is a good practice to run it as a non-privileged user. In this recipe, we are using the node name nn1.cluster1.com
, preinstalled with CentOS 6.5.
Install JDK, which will be used by Hadoop services. The minimum recommended version of JDK is 1.7, but Open JDK can also be used.
Log into the machine/host as root user and install jdk:
# yum install jdk –y or it can also be installed using the command as below # rpm –ivh jdk-1.7u45.rpm
Once Java is installed, make sure Java is in
PATH
for execution. This can be done by settingJAVA_HOME
and exporting it as an environment variable. The following screenshot shows the content of the directory where Java gets installed:# export JAVA_HOME=/usr/java/latest
Now we need to copy the tarball
hadoop-2.7.3.tar.gz
--which was built in the Build Hadoop section earlier in this chapter—to the home directory of the user root. For this, the user needs to login to the node where Hadoop was built and execute the following command:# scp –r hadoop-2.7.3.tar.gz root@nn1.cluster1.com:~/
Create a directory named
/opt/cluster
to be used for Hadoop:# mkdir –p /opt/cluster
Then untar the
hadoop-2.7.3.tar.gz
to the preceding created directory:# tar –xzvf hadoop-2.7.3.tar.gz -C /opt/Cluster/
Create a user named
hadoop
, if you haven't already, and set the password ashadoop
:# useradd hadoop # echo hadoop | passwd --stdin hadoop
As step 6 was done by the root user, the directory and file under
/opt/cluster
will be owned by the root user. Change the ownership to the Hadoop user:# chown -R hadoop:hadoop /opt/cluster/
If the user lists the directory structure under
/opt/cluster
, he will see it as follows:The directory structure under
/opt/cluster/hadoop-2.7.3
will look like the one shown in the following screenshot:The
etc/hadoop
directory is the one that contains the configuration files for configuring various Hadoop daemons. Some of the key files arecore-site.xml
,hdfs-site.xml
,hadoop-env.xml
, andmapred-site.xml
among others, which will be explained in the later sections:The directories
bin
andsbin
contain executable binaries, which are used to start and stop Hadoop daemons and perform other operations such as filesystem listing, copying, deleting, and so on:To execute a command
/opt/cluster/hadoop-2.7.3/bin/hadoop, a
complete path to the command needs to be specified. This could be cumbersome, and can be avoided by setting the environment variableHADOOP_HOME
.Similarly, there are other variables that need to be set that point to the binaries and the configuration file locations:
The environment file is set up system-wide so that any user can use the commands. Once the
hadoopenv.sh
file is in place, execute the command to export the variables defined in it:Change to the
Hadoop
user using the commandsu – hadoop
:Change to the
/opt/cluster
directory and create a symlink:To verify that the preceding changes are in place, the user can execute either the
which Hadoop
orwhich java
commands, or the user can execute the commandhadoop
directly without specifying the complete path.In addition to setting the environment as discussed, the user has to add the
JAVA_HOME
variable in thehadoop-env.sh
file.The next thing is to set up the Namenode address, which specifies the
host:port
address on which it will listen. This is done using the filecore-site.xml
:The important thing to keep in mind is the property
fs.defaultFS
.The next thing that the user needs to configure is the location where Namenode will store its metadata. This can be any location, but it is recommended that you always have a dedicated disk for it. This is configured in the file
hdfs-site.xml
:The next step is to format the Namenode. This will create an HDFS file system:
$ hdfs namenode -format
Similarly, we have to add the rule for the
Datanode
directory underhdfs-site.xml
. Nothing needs to be done to thecore-site.xml
file:Then the services need to be started for Namenode and Datanode:
$ hadoop-daemon.sh start namenode $ hadoop-daemon.sh start datanode
The command
jps
can be used to check for running daemons:
The master Namenode stores metadata and the slave node Datanode stores the blocks. When the Namenode is formatted, it creates a data structure that contains fsimage
, edits
, and VERSION
. These are very important for the functioning of the cluster.
The parameters dfs.data.dir
and dfs.datanode.data.dir
are used for the same purpose, but are used across different versions. The older parameters are deprecated in favor of the newer ones, but they will still work. The parameter dfs.name.dir
has been deprecated in favor of dfs.namenode.name.dir
in Hadoop 2.x. The intention of showing both versions of the parameter is to bring to the user's notice that parameters are evolving and ever changing, and care must be taken by referring to the release notes for each Hadoop version.
In the preceding recipe, we set up the storage layer—that is, the HDFS for storing data—but what about the processing layer?. The data on HDFS needs to be processed to make a meaningful decision using MapReduce, Tez, Spark, or any other tool. To run the MapReduce, Spark or other processing framework we need to have ResourceManager, NodeManager.
In the previous recipe, we discussed how to set up Namenode and Datanode for HDFS. In this recipe, we will be covering how to set up YARN on the same node.
After completing this recipe, there will be four daemons running on the nn1.cluster1.com
node, namely namenode
, datanode
, resourcemanager
, and nodemanager
daemons.
For this recipe, you will again use the same node on which we have already configured the HDFS layer.
Log in to the node
nn1.cluster1.com
and change to thehadoop
user.Change to the
/opt/cluster/hadoop/etc/hadoop
directory and configure the filesmapred-site.xml
andyarn-site.xml
:The file
yarn-site.xml
specifies the shuffle class, scheduler, and resource management components of the ResourceManager. You only need to specifyyarn.resourcemanager.address
; the rest are automatically picked up by the ResourceManager. You can see from the following screenshot that you can separate them into their independent components:Once the configurations are in place, the
resourcemanager
andnodemanager
daemons need to be started:The environment variables that were defined by
/etc/profile.d/hadoopenv.sh
includedYARN_HOME
andYARN_CONF_DIR
, which let the framework know about the location of the YARN configurations.
The nn1.cluster1.com
node is configured to run HDFS and YARN components. Any file that is copied to the HDFS will be split into blocks and stored on Datanode. The metadata of the file will be on the Namenode.
Any operation performed on a text file, such as word count, can be done by running a simple MapReduce program, which will be submitted to the single node cluster using the ResourceManager daemon and executed by the NodeManager. There are a lot of steps and details as to what goes on under the hood, which will be covered in the coming chapters.
A quick check can be done on the functionality of HDFS. You can create a simple text file and upload it to HDFS to see whether it is successful or not:
$ hadoop fs –put test.txt /
This will copy the file test.txt
to the HDFS. The file can be read directly from HDFS:
$ hadoop fs –ls / $ hadoop fs –cat /test.txt
In the previous recipes, we looked at how to configure a single-node Hadoop cluster, also referred to as pseudo-distributed cluster. In this recipe, we will set up a fully distributed cluster with each daemon running on separate nodes.
There will be one node for Namenode, one for ResourceManager, and four nodes will be used for Datanode and NodeManager. In production, the number of Datanodes could be in the thousands, but here we are just restricted to four nodes. The Datanode and NodeManager coexist on the same nodes for the purposes of data locality and locality of reference.
Make sure that the six nodes the user chooses have JDK installed, with name resolution working. This could be done by making entries in the /etc/hosts
file or using DNS.
In this recipe, we are using the following nodes:
Namenode:
nn1.cluster1.com
ResourceManager:
jt1.cluster1.com
Datanodes and NodeManager:
dn[1-4].cluster1.com
Make sure all the nodes have the
hadoop
user.Create the directory structure
/opt/cluster
on all the nodes.Make sure the ownership is correct for
/opt/cluster
.Copy the
/opt/cluster/hadoop-2.7.3
directory from thenn1.cluster.com
to all the nodes in the cluster:$ for i in 192.168.1.{72..75};do scp -r hadoop-2.7.3 $i:/opt/cluster/ $i;done
The preceding IPs belong to the nodes in the cluster. The user needs to modify them accordingly. Also, to prevent it from prompting for password for each node, it is good to set up pass phraseless access between each node.
Change to the directory
/opt/cluster
and create a symbolic link on each node:$ ln –s hadoop-2.7.3 hadoop
Make sure that the environment variables have been set up on all nodes:
$ . /etc/profile.d/hadoopenv.sh
On Namenode, only the parameters specific to it are needed.
The file
core-site.xml
remains the same across all nodes in the cluster.On Namenode, the file
hdfs-site.xml
changes as follows:On Datanode, the file
hdfs-site.xml
changes as follows:On Datanodes, the file
yarn-site.xml
changes as follows:On the node jt1, which is ResourceManager, the file
yarn-site.xml
is as follows:To start Namenode on
nn1.cluster1.com
, enter the following:$ hadoop-daemon.sh start namenode
To start Datanode and NodeManager on
dn[1-4]
, enter the following:$ hadoop-daemon.sh start datanode $ yarn-daemon.sh start nodemanager
To start ResourceManager on
jt1.cluster.com
, enter the following:$ yarn-daemon.sh start resourcemanager
On each node, execute the command
jps
to see the daemons running on them. Make sure you have the correct services running on each node.Create a text file
test.txt
and copy it to HDFS usinghadoop fs –put test.txt /
. This confirms that HDFS is working fine.To verify that YARN has been set up correctly, run the simple "Pi" estimation program:
$ yarn jar /opt/cluster/hadoop/share/hadoop/mapreduce/hadoop-example.jar Pi 3 3
Steps 1 through 7 copy the already extracted and configured Hadoop files to other nodes in the cluster. From step 8 onwards, each node is configured according to the role it plays in the cluster.
The user should see four Datanodes reporting to the cluster, and should also be able to access the UI of the Namenode on port 50070
and on port 8088
for ResourceManager.
To see the number of nodes talking to Namenode, enter the following:
$ hdfs dfsadmin -report Configured Capacity: 9124708352 (21.50 GB) Present Capacity: 5923942400 (20.52 GB) DFS Remaining: 5923938304 (20.52 GB) DFS Used: 4096 (4 KB) DFS Used%: 0.00% Live datanodes (4):
The same information can also be retrieved using the Namenode Web UI as shown in the following screenshot:

Hadoop Gateway or edge node is a node that connects to the Hadoop cluster, but does not run any of the daemons. The purpose of an edge node is to provide an access point to the cluster and prevent users from a direct connection to critical components such as Namenode or Datanode.
Another important reason for its use is the data distribution across the cluster. If a user connects to a Datanode and performs the data copy operation hadoop fs –put file /
, then one copy of the file will always go to the Datanode from which the copy command was executed. This will result in an imbalance of data across the node. If we upload a file from a node that is not a Datanode, then data will be distributed evenly for all copies of data.
In this recipe, we will configure an edge node for a Hadoop cluster.
For the edge node, the user needs a separate Linux machine with Java installed and the user hadoop
in place.
ssh
to the new node that is to be configured as Gateway node. For example, the node name could beclient1.cluster1.com
.Set up the environment variable as discussed before. This can be done by setting the
/etc/profile.d/hadoopenv.sh
file.Copy the already configured directory
hadoop-2.7.3
from Namenode to this node (client1.cluster1.com
). This avoids doing all the configuration for files such ascore-site.xml
andyarn-site.xml
.The edge node just needs to know about the two master nodes of Namenode and ResourceManager. It does not need any other configuration for the time being. It does not store any data locally, unlike Namenode and Datanode.
It only needs to write temporary files and logs. In later chapters, we will see other parameters for MapReduce and performance tuning that go on this node.
Create a symbolic link
ln –s hadoop-2.7.3 hadoop
so that the commands and Hadoop configuration files are visible.There will be no daemon started on this node. Execute a command from the edge node to make sure the user can connect to
hadoop fs –ls /
.To verify that the edge node has been set up correctly, run the simple "Pi" estimation program from the edge node:
$ yarn jar /opt/cluster/hadoop/share/hadoop/mapreduce/hadoop-example.jar Pi 3 3
The edge node or the Gateway node connects to Namenode for all HDFS-related operation and connects to ResourceManager for submitted jobs to the cluster.
In production, there will be more than one edge node connecting to the cluster for high availability. This is can be done by using a load balancer or DNS round-robin. No user should run any local jobs on the edge nodes or use it for doing non Hadoop-related tasks.
There will always be failures in clusters, such as hardware issues or a need to upgrade nodes. This should be done in a graceful manner, without any data loss.
When the Datanode daemon is stopped on a Datanode, it takes approximately ten minutes for the Namenode to remove that node. This has to do with the heartbeat retry interval. At any time, we can abruptly remove the Datanode, but it can result in data loss.
It is recommended that you opt for the graceful removal of the node from the cluster, as this ensures that all the data on that node is drained.
For the following steps, we assume that the cluster that is up and running with Datanodes is in a healthy state and the one with the Datanode dn1.cluster1.com
needs maintenance and must be removed from the cluster. We will login to the Namenode and make changes there.
ssh to Namenode and edit the file
hdfs-site.xml
by adding the following property to it:<property> <name>dfs.hosts.exclude</name> <value>/home/hadoop/excludes</value> <final>true</final> </property>
Make sure the file
excludes
is readable by the userhadoop
.Restart the Namenode daemon for the property to take effect:
$ hadoop-daemons.sh stop namenode $ hadoop-daemons.sh start namenode
A restart of Namenode is required only when any property is changed in the file. Once the property is in place, Namenode can read the changes to the contents of the file
excludes
by simply refreshing nodes.Add the
dn1.cluster1.com
node to the fileexcludes
:$ cat excludes dn1.cluster1.com
After adding the node to the file, we just need to reload the file by doing the following:
$ hadoop dfsadmin -refreshNodes
After sometime, the node will be decommissioned. The time will vary according to the data the particular Datanode had. We can see the decommissioned nodes using the following:
$ hdfs dfsadmin -report
The preceding command will list the nodes in the cluster, and against the
dn1.cluster1.com
node we can see that its status will either be decommissioning or decommissioned.
Let's have a look at what we did throughout this recipe.
In steps 1 through 6, we added the new property to the hdfs-site.xml
file and then restarted Namenode to make it aware of the changes. Once the property is in place, the Namenode is aware of the excludes
file, and it can be asked to re-read by simply refreshing the node list, as done in step 6.
With these steps, the data on the Datanode dn1.cluster1.com
will be moved to other nodes in the cluster, and once the data has been drained, the Datanode daemon on the node will be shutdown. During the process, the node will change the status from normal to decommissioning and then to decommissioned.
Care must be taken while decommissioning nodes in the cluster. The user should not decommission multiple nodes at a time as this will generate lot of network traffic and cause congestion and data loss.
Over a period of time, our cluster will grow in data and there will be a need to increase the capacity of the cluster by adding more nodes.
We can add Datanodes to the cluster in the same way that we first configured the Datanode started the Datanode daemon on it. But the important thing to keep in mind is that all nodes can be part of the cluster. It should not be that anyone can just start a Datanode daemon on his laptop and join the cluster, as it will be disastrous. By default, there is nothing preventing any node being a Datanode, as the user has just to untar the Hadoop package and point the file "core-site.xml" to the Namenode and start the Datanode daemon.
For the following steps, we assume that the cluster that is up and running with Datanodes is in a healthy state and we need to add a new Datanode in the cluster. We will login to the Namenode and make changes there.
ssh to Namenode and edit the file
hdfs-site.xml
to add the following property to it:<property> <name>dfs.hosts</name> <value>/home/hadoop/includes</value> <final>true</final> </property>
Make sure the file
includes
is readable by the userhadoop
.Restart the Namenode daemon for the property to take effect:
$ hadoop-daemons.sh stop namenode $ hadoop-daemons.sh start namenode
A restart of Namenode is required only when any property is changed in the file. Once the property is in place, Namenode can read the changes to the contents of the
includes
file by simply refreshing the nodes.Add the
dn1.cluster1.com
node to the fileexcludes
:$ cat includes dn1.cluster1.com
The file
includes
orexcludes
can contain a list of multiple nodes, one node per line.After adding the node to the file, we just need to reload the file by entering the following:
$ hadoop dfsadmin -refreshNodes
After some time, the node will be available in the cluster and can be seen:
$ hdfs dfsadmin -report
The file /home/hadoop/includes
will contain a list of all the Datanodes that are allowed to join a cluster. If the file includes
is blank, then all Datanodes are allowed to join the cluster. If there is both an include
and exclude
file, the list of nodes must be mutually exclusive in both the files. So, to decommission the node dn1.cluster.com
from the cluster, it must be removed from the includes
file and added to the excludes
file.