Reader small image

You're reading from  Hadoop 2.x Administration Cookbook

Product typeBook
Published inMay 2017
PublisherPackt
ISBN-139781787126732
Edition1st Edition
Tools
Right arrow
Author (1)
Aman Singh
Aman Singh
author image
Aman Singh

Gurmukh Singh is a seasoned technology professional with 14+ years of industry experience in infrastructure design, distributed systems, performance optimization, and networks. He has worked in big data domain for the last 5 years and provides consultancy and training on various technologies. He has worked with companies such as HP, JP Morgan, and Yahoo. He has authored Monitoring Hadoop by Packt Publishing
Read more about Aman Singh

Right arrow

Configuring the Hadoop Gateway node


Hadoop Gateway or edge node is a node that connects to the Hadoop cluster, but does not run any of the daemons. The purpose of an edge node is to provide an access point to the cluster and prevent users from a direct connection to critical components such as Namenode or Datanode.

Another important reason for its use is the data distribution across the cluster. If a user connects to a Datanode and performs the data copy operation hadoop fs –put file /, then one copy of the file will always go to the Datanode from which the copy command was executed. This will result in an imbalance of data across the node. If we upload a file from a node that is not a Datanode, then data will be distributed evenly for all copies of data.

In this recipe, we will configure an edge node for a Hadoop cluster.

Getting ready

For the edge node, the user needs a separate Linux machine with Java installed and the user hadoop in place.

How to do it...

  1. ssh to the new node that is to be configured as Gateway node. For example, the node name could be client1.cluster1.com.

  2. Set up the environment variable as discussed before. This can be done by setting the /etc/profile.d/hadoopenv.sh file.

  3. Copy the already configured directory hadoop-2.7.3 from Namenode to this node (client1.cluster1.com). This avoids doing all the configuration for files such as core-site.xml and yarn-site.xml.

  4. The edge node just needs to know about the two master nodes of Namenode and ResourceManager. It does not need any other configuration for the time being. It does not store any data locally, unlike Namenode and Datanode.

  5. It only needs to write temporary files and logs. In later chapters, we will see other parameters for MapReduce and performance tuning that go on this node.

  6. Create a symbolic link ln –s hadoop-2.7.3 hadoop so that the commands and Hadoop configuration files are visible.

  7. There will be no daemon started on this node. Execute a command from the edge node to make sure the user can connect to hadoop fs –ls /.

  8. To verify that the edge node has been set up correctly, run the simple "Pi" estimation program from the edge node:

    $ yarn jar /opt/cluster/hadoop/share/hadoop/mapreduce/hadoop-example.jar Pi 3 3
    

How it works...

The edge node or the Gateway node connects to Namenode for all HDFS-related operation and connects to ResourceManager for submitted jobs to the cluster.

In production, there will be more than one edge node connecting to the cluster for high availability. This is can be done by using a load balancer or DNS round-robin. No user should run any local jobs on the edge nodes or use it for doing non Hadoop-related tasks.

See also

Edge node can be used to configure many additional components, such as PIG, Hive, Sqoop, rather than installing them on the main cluster nodes like Namenode, Datanode. This way it is easy to segregate the complexity and restrict access to just edge node.

  • The Configuring Hive recipe

Previous PageNext Page
You have been reading a chapter from
Hadoop 2.x Administration Cookbook
Published in: May 2017Publisher: PacktISBN-13: 9781787126732
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Aman Singh

Gurmukh Singh is a seasoned technology professional with 14+ years of industry experience in infrastructure design, distributed systems, performance optimization, and networks. He has worked in big data domain for the last 5 years and provides consultancy and training on various technologies. He has worked with companies such as HP, JP Morgan, and Yahoo. He has authored Monitoring Hadoop by Packt Publishing
Read more about Aman Singh