(For more resources related to this topic, see here.)
We will configure our NameNode HA setup by adding several options to the core-site.xml file. The following is the structure of the file for this particular step. It will give you an idea of the XML structure, if you are not familiar with it. The header comments are stripped out:
<?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <configuration> <property> <name>fs.default.name</name> <value>hdfs://sample-cluster/</value> </property> <property> <name>ha.zookeeper.quorum</name> <value>nn1.hadoop.test.com:2181,nn2.hadoop.test.com:2181,jt1.hadoop. test.com:2181 </value> </property> </configuration>
The configuration file format is pretty much self-explanatory; variables are surrounded by the <property> tag, and each variable has a name and a value.
There are only two variables that we need to add at this stage. fs.default.name is the logical name of the NameNode cluster. The value hdfs://sample-cluster/ is specific to the HA setup. This is the logical name of the NameNode cluster. We will define the servers that comprise of it in the hdfs-site.xml file. In a non-HA setup, this variable is assigned a host and a port of the NameNode, since there is only one NameNode in the cluster.
The ha.zookeeper.quorum variable specifies locations and ports of the ZooKeeper servers. The ZooKeeper cluster can be used by other services, such as HBase, that is why it is defined in core-site.xml.
The next step is to configure the hdfs-site.xml file and add all HDFS-specific parameters there. I will omit the <property> tag and only include <name> and <value> to make the list less verbose.
NameNode will use the location specified by the dfs.name.dir variable to store the persistent snapshot of HDFS metadata. This is where the fsimage file will be stored. As discussed previously, the volume on which this directory resides needs to be backed by RAID. Losing this volume means losing NameNode completely. The /dfs/nn path is an example, however you are free to choose your own. You can actually specify several paths with a dfs.name.dir value, separating them by commas. NameNode will mirror the metadata files in each directory specified. If you have a shared network storage available, you can use it as one of the destinations for HDFS metadata. This will provide additional offsite backups.
The dfs.nameservices variable specifies the logical name of the NameNode cluster and should be replaced by something that makes sense to you, such as prod-cluster or stage-cluster. The value of dfs.nameservices must match the value of fs.default.name from the core-site.xml file.
Here, we specify the NameNodes that make up our HA cluster setup. These are logical names, not real server hostnames or IPs. These logical names will be referenced in other configuration variables.
<name>dfs.namenode.rpc-address.sample-cluster.nn1</name> <value>nn1.hadoop.test.com:8020</value> <name>dfs.namenode.rpc-address.sample-cluster.nn2</name> <value>nn2.hadoop.test.com:8020</value>
This pair of variables provide mapping from logical names like nn1 and nn2 to the real host and port value. By default, NameNode daemons use port 8020 for communication with clients and each other. Make sure this port is open for the cluster nodes.
<name>dfs.namenode.http-address.sample-cluster.nn1</name> <value>nn1.hadoop.test.com:50070</value> <name>dfs.namenode.http-address.sample-cluster.nn2</name> <value>nn2.hadoop.test.com:50070</value>
Each NameNode daemon runs a built-in HTTP server, which will be used by the NameNode web interface to expose various metrics and status information about HDFS operations. Additionally, standby NameNode uses HTTP calls to periodically copy the fsimage file from the primary server, perform the checkpoint operation, and ship it back.
<name>dfs.namenode.shared.edits.dir</name> <value>qjournal://nn1.hadoop.test.com:8485;nn2.hadoop.test.com:8485; jt1.hadoop.test.com:8485/sample-cluster</value>
The dfs.namenode.shared.edits.dir variable specifies the setup of the JournalNode cluster. In our configuration, there are three JournalNodes running on nn1, nn2, and nn3. Both primary and standby nodes will use this variable to identify which hosts they should contact to send or receive new changes from editlog.
JournalNodes need to persist editlog changes that are being submitted to them by the active NameNode. The dfs.journalnode.edits.dir variable specifies the location on the local filesystem where editlog changes will be stored. Keep in mind that this path must exist on all JournalNodes and the ownership of all directories must be set to hdfs:hdfs (user and group).
<name>dfs.client.failover.proxy.provider.sample-cluster</name> <value>org.apache.hadoop.hdfs.server.namenode.ha. ConfiguredFailoverProxyProvider</value>
In an HA setup, clients that access HDFS need to know which NameNode to contact for their requests. The dfs.client.failover.proxy.provider.sample-cluster variable specifies the Java class name, which will be used by clients for determining the active NameNode.
At the moment, there is only ConfiguredFailoverProxyProvider available.
The dfs.ha.automatic-failover.enabled variable indicates if the NameNode cluster will use a manual or automatic failover.
<name>dfs.ha.fencing.methods</name> <value>sshfence shell(/bin/true) </value>
Orchestrating failover in a cluster setup is a complicated task involving multiple steps. One of the common problems that is not unique to the Hadoop cluster, but affects any distributed systems, is a "split-brain" scenario. Split-brain is a case where two NameNodes decide they both play an active role and start writing changes to the editlog. To prevent such an issue from occurring, the HA configuration maintains a marker in ZooKeeper, clearly stating which NameNode is active, and JournalNodes accepts writes only from that node. To be absolutely sure that the two NameNodes don't become active at the same time, a technique called fencing is used during failover. The idea is to force the shutdown of the active NameNode before transferring the active state to a standby.
There are two fencing methods currently available: sshfence and shell. sshfence. These require a passwordless ssh access as a user that starts the NameNode daemon, from the active NameNode to the standby and vice versa. By default, this is the hdfs user. The fencing process checks if there is anyone listening on a NameNode port using the nc command, and if the port is found busy, it tries to kill the NameNode process. Another option for dfs.ha.fencing.methods is shell. This will execute the specified shell script to perform fencing. It is important to understand that failover will fail if fencing fails. In our case, we specified two options, the second one always returns success. This is done for workaround cases where the primary NameNode machine goes down and the ssh method will fail, and no failover will be performed. We want to avoid this, so the second option would be to failover anyway, even without fencing, which, as already mentioned, is safe with our setup. To achieve this, we specify two fencing methods, which will be tried by ZKFC in the order of: if the first one fails, the second one will be tried. In our case, the second one will always return success and failover will be initiated, even if the server running the primary NameNode is not available via ssh.
The last option we will need to configure for NameNode HA setup is the ssh key, which will be used by sshfence. Make sure you change the ownership for this file to hdfs user. Two keys need to be generated, one for the primary and one for the secondary NameNode. It is a good idea to test ssh access as an hdfs user in both directions to make sure it is working fine.
The hdfs-site.xml configuration file is now all set for testing the HA setup. Don't forget to sync these configuration files to all the nodes in the cluster. The next thing that needs to be done is to start JournalNodes. Execute this command on nn1, nn2, and jt1 a root user:
# service hadoop-hdfs-journalnode start
With CDH, it is recommended to always use the service command instead of calling scripts in /etc/init.d/ directly. This is done to guarantee that all environment variables are set up properly before the daemon is started. Always check the logfiles for daemons.
Now, we need to initially format HDFS. For this, run the following command on nn1:
# sudo -u hdfs hdfs namenode –format
This is the initial setup of the NameNode, so we don't have to worry about affecting any HDFS metadata, but be careful with this command, because it will destroy any previous metadata entries. There is no strict requirement to run format command on nn1, but to make it easier to follow, let's assume we want nn1 to become an active NameNode. Format command will also format the storage for the JournalNodes.
The next step is to create an entry for the HA cluster in ZooKeeper, and start NameNode and ZKFC on the first NameNode. In our case, this is nn1:
# sudo -u hdfs hdfs zkfc -formatZK # service hadoop-hdfs-namenode start # service hadoop-hdfs-zkfc start
Check the ZKFC log file (by default, it is in /var/log/hadoop-hdfs/) to make sure nn1 is now an active NameNode:
INFO org.apache.hadoop.ha.ZKFailoverController: Trying to make NameNode at nn1.hadoop.test.com/192.168.0.100:8020 active... INFO org.apache.hadoop.ha.ZKFailoverController: Successfully transitioned NameNode at nn1.hadoop.test.com/192.168.0.100:8020 to active state
To activate the secondary NameNode, an operation called bootstrapping needs to be performed. To do this, execute the following command on nn2:
# sudo -u hdfs hdfs namenode –bootstrapStandby
This will pull the current filesystem state from active NameNode and synchronize the secondary NameNode with the JournalNodes Quorum.
Now, you are ready to start the NameNode daemon and the ZKFC daemon on nn2. Use the same commands that you used for nn1. Check the ZKFC log file to make sure nn2 successfully acquired the secondary NameNode role. You should see the following messages at the end of the logfile:
INFO org.apache.hadoop.ha.ZKFailoverController: ZK Election indicated that NameNode at nn2.hadoop.test.com/192.168.0.101:8020 should become standby INFO org.apache.hadoop.ha.ZKFailoverController: Successfully transitioned NameNode at nn2.hadoop.test.com/192.168.0.101:8020 to standby state
This is the last step in configuring NameNode HA. It is a good idea to verify if automatic failover is configured correctly, and if it will behave as expected in the case of a primary NameNode outage. Testing failover in the cluster setup stage is easier and safer than discovering that failover doesn't work during production stage and causing a cluster outage. You can perform a simple test: kill the primary NameNode daemon and verify if the secondary takes over its role. After that, bring the old primary back online and make sure it takes over the secondary role.
You can use execute the following command to get the current status of NameNode nn1:
# sudo -u hdfs hdfs haadmin -getServiceState nn1
The hdfs haadmin command can also be used to initiate a failover in manual failover setup.
At this point, you have a fully configured and functional NameNode HA setup.
We saw in this article how to configure Hadoop's NameNode HA.
Resources for Article:
- Advanced Hadoop MapReduce Administration [Article]
- Managing a Hadoop Cluster [Article]
- Making Big Data Work for Hadoop and Solr [Article]