Advanced Hadoop MapReduce Administration

Exclusive offer: get 50% off this eBook here
Hadoop MapReduce Cookbook

Hadoop MapReduce Cookbook — Save 50%

Recipes for analyzing large and complex datasets with Hadoop MapReduce book and ebook.

$29.99    $15.00
by Srinath Perera Thilina Gunarathne | April 2013 | Cookbooks Open Source

In this article by Srinath Perera and Thilina Gunarathne, authors of Hadoop MapReduce Cookbook, we will cover:

  • Tuning Hadoop configurations for cluster deployments

  • Running benchmarks to verify the Hadoop installation

  • Reusing Java VMs to improve the performance

  • Fault tolerance and speculative execution

  • Debug scripts – analyzing task failures

  • Setting failure percentages and skipping bad records

  • Shared-user Hadoop clusters – using fair and other schedulers

  • Hadoop security – integrating with Kerberos

  • Using the Hadoop Tool interface

(For more resources related to this topic, see here.)

Tuning Hadoop configurations for cluster deployments

Getting ready

Shut down the Hadoop cluster if it is already running, by executing the bin/stop-dfs.sh and bin/stop-mapred.sh commands from HADOOP_HOME.

How to do it...

We can control Hadoop configurations through the following three configuration files:

  • conf/core-site.xml: This contains the configurations common to whole Hadoop distribution

  • conf/hdfs-site.xml: This contains configurations for HDFS

  • conf/mapred-site.xml: This contains configurations for MapReduce

Each configuration file has name-value pairs expressed in an XML format, and they define the workings of different aspects of Hadoop. The following code snippet shows an example of a property in the configuration file. Here, the <configuration> tag is the top-level XML container, and the <property> tags that define individual properties go as child elements of the <configuration> tag.

<configuration>
<property>
<name>mapred.reduce.parallel.copies</name>
<value>20</value>
</property>
...
</configuration>

The following instructions show how to change the directory to which we write Hadoop logs and configure the maximum number of map and reduce tasks:

  1. Create a directory to store the logfiles. For example, /root/hadoop_logs.

  2. Uncomment the line that includes HADOOP_LOG_DIR in HADOOP_HOME/conf/ hadoop-env.sh and point it to the new directory.

  3. Add the following lines to the HADOOP_HOME/conf/mapred-site.xml file:

    <property>
    <name>mapred.tasktracker.map.tasks.maximum</name>
    <value>2 </value>
    </property>
    <property>
    <name>mapred.tasktracker.reduce.tasks.maximum</name>
    <value>2 </value>
    </property>

  4. Restart the Hadoop cluster by running the bin/stop-mapred.sh and bin/start-mapred.sh commands from the HADOOP_HOME directory.

  5. You can verify the number of processes created using OS process monitoring tools. If you are in Linux, run the watch ps –ef|grep hadoop command. If you are in Windows or MacOS use the Task Manager.

How it works...

HADOOP_LOG_DIR redefines the location to which Hadoop writes its logs. The mapred. tasktracker.map.tasks.maximum and mapred.tasktracker.reduce.tasks. maximum properties define the maximum number of map and reduce tasks that can run within a single TaskTracker at a given moment.

These and other server-side parameters are defined in the HADOOP_HOME/conf/*-site. xml files. Hadoop reloads these configurations after a restart.

There's more...

There are many similar configuration properties defined in Hadoop. You can see some of them in the following tables.

The configuration properties for conf/core-site.xml are listed in the following table:

Name

Default value

Description

fs.inmemory.size.mb

100

This is the amount of memory allocated to the in-memory filesystem that is used to merge map outputs at reducers in MBs.

io.sort.factor

100

This is the maximum number of streams

merged while sorting files.

io.file.buffer.size

131072

This is the size of the read/write buffer used by sequence files.

The configuration properties for conf/mapred-site.xml are listed in the following table:

Name

Default value

Description

mapred.reduce.

parallel.copies

5

This is the maximum number of parallel copies the reduce step will execute to fetch output from many parallel jobs.

mapred.map.child.java.

opts

-Xmx200M

This is for passing Java options into the map JVM.

mapred.reduce.child.

java.opts

-Xmx200M

This is for passing Java options into the reduce JVM.

io.sort.mb

200

The memory limit while sorting data in MBs.

The configuration properties for conf/hdfs-site.xml are listed in the following table:

Name

Default value

Description

dfs.block.size

67108864

This is the HDFS block size.

dfs.namenode.handler.

count

40

This is the number of server threads to handle RPC calls in the NameNode.

Running benchmarks to verify the Hadoop installation

The Hadoop distribution comes with several benchmarks. We can use them to verify our Hadoop installation and measure Hadoop's performance. This recipe introduces these benchmarks and explains how to run them.

Getting ready

Start the Hadoop cluster. You can run these benchmarks either on a cluster setup or on a pseudo-distributed setup.

How to do it...

Let us run the sort benchmark. The sort benchmark consists of two jobs. First, we generate some random data using the randomwriter Hadoop job and then sort them using the sort sample.

  1. Change the directory to HADOOP_HOME.

  2. Run the randomwriter Hadoop job using the following command:

    >bin/hadoop jar hadoop-examples-1.0.0.jarrandomwriter
    -Dtest.randomwrite.bytes_per_map=100
    -Dtest.randomwriter.maps_per_host=10 /data/unsorted-data

    Here the two parameters, test.randomwrite.bytes_per_map and test. randomwriter.maps_per_host specify the size of data generated by a map and the number of maps respectively.

  3. Run the sort program:

    >bin/hadoop jar hadoop-examples-1.0.0.jar sort /data/unsorted-data
    /data/sorted-data

  4. Verify the final results by running the following command:

    >bin/hadoop jar hadoop-test-1.0.0.jar testmapredsort -sortInput /
    data/unsorted-data -sortOutput /data/sorted-data

Finally, when everything is successful, the following message will be displayed:

The job took 66 seconds.
SUCCESS! Validated the MapReduce framework's 'sort' successfully.

How it works...

First, the randomwriter application runs a Hadoop job to generate random data that can be used by the second sort program. Then, we verify the results through testmapredsort job. If your computer has more capacity, you may run the initial randomwriter step with increased output sizes.

There's more...

Hadoop includes several other benchmarks.

  • TestDFSIO: This tests the input output (I/O) performance of HDFS

  • nnbench: This checks the NameNode hardware

  • mrbench: This runs many small jobs

  • TeraSort: This sorts a one terabyte of data

More information about these benchmarks can be found at http://www.michaelnoll.com/blog/2011/04/09/benchmarking-and-stress-testing-an-hadoopcluster- with-terasort-testdfsio-nnbench-mrbench/.

Reusing Java VMs to improve the performance

In its default configuration, Hadoop starts a new JVM for each map or reduce task. However, running multiple tasks from the same JVM can sometimes significantly speed up the execution. This recipe explains how to control this behavior.

How to do it...

  1. Run the WordCount sample by passing the following option as an argument:

    >bin/hadoop jar hadoop-examples-1.0.0.jar wordcount –Dmapred.job.
    reuse.jvm.num.tasks=-1 /data/input1 /data/output1

  2. Monitor the number of processes created by Hadoop (through ps –ef|grephadoop command in Unix or task manager in Windows). Hadoop starts only a single JVM per task slot and then reuses it for an unlimited number of tasks in the job.

    However, passing arguments through the –D option only works if the job implements the org.apache.hadoop.util.Tools interface. Otherwise, you should set the option through the JobConf.setNumTasksToExecutePerJvm(-1) method.

How it works...

By setting the job configuration property through mapred.job.reuse.jvm.num.tasks, we can control the number of tasks for the JVM run by Hadoop. When the value is set to -1, Hadoop runs the tasks in the same JVM.

Hadoop MapReduce Cookbook Recipes for analyzing large and complex datasets with Hadoop MapReduce book and ebook.
Published: January 2013
eBook Price: $29.99
Book Price: $49.99
See more
Select your format and quantity:

Fault tolerance and speculative execution

The primary advantage of using Hadoop is its support for fault tolerance. When you run a job, especially a large job, parts of the execution can fail due to external causes such as network failures, disk failures, and node failures.

When a job has been started, Hadoop JobTracker monitors the TaskTrackers to which it has submitted the tasks of the job. If any TaskTrackers are not responsive, Hadoop will resubmit the tasks handled by unresponsive TaskTracker to a new TaskTracker.

Generally, a Hadoop system may be compose of heterogeneous nodes, and as a result there can be very slow nodes as well as fast nodes. Potentially, a few slow nodes can slow down an execution significantly.

To avoid this, Hadoop supports speculative executions. This means if most of the map tasks have completed and Hadoop is waiting for a few more map tasks, Hadoop JobTracker will start these pending jobs also in a new node. The tracker will use the results from the first task that finishes and stop any other identical tasks.

However, the above model is feasible only if the map tasks are side-effects free. If such parallel executions are undesirable, Hadoop lets users turn off speculative executions.

How to do it...

Run the WordCount sample by passing the following option as an argument to turn off the speculative executions:

bin/hadoop jar hadoop-examples-1.0.0.jar wordcount–Dmapred.map.tasks.
speculative.execution=false –D mapred.reduce.tasks.speculative.
execution=true /data/input1 /data/output1

However, this only works if the job implements the org.apache.hadoop.util.Tools interface. Otherwise, you should set the parameter through JobConf.set(name, value).

How it works...

When the option is specified and set to false, Hadoop will turn off the speculative executions. Otherwise, it will perform speculative executions by default.

Debug scripts – analyzing task failures

A Hadoop job may consist of many map tasks and reduce tasks. Therefore, debugging a Hadoop job is often a complicated process. It is a good practice to first test a Hadoop job using unit tests by running it with a subset of the data.

However, sometimes it is necessary to debug a Hadoop job in a distributed mode. To support such cases, Hadoop provides a mechanism called debug scripts. This recipe explains how to use debug scripts.

Getting ready

Start the Hadoop cluster.

How to do it...

A debug script is a shell script, and Hadoop executes the script whenever a task encounters an error. The script will have access to the $script, $stdout, $stderr, $syslog, and $jobconf properties, as environment variables populated by Hadoop. You can find a sample script from resources/chapter3/debugscript. We can use the debug scripts to copy all the logfiles to a single location, e-mail them to a single e-mail account, or perform some analysis.

LOG_FILE=HADOOP_HOME/error.log
echo "Run the script" >> $LOG_FILE
echo $script >> $LOG_FILE
echo $stdout>> $LOG_FILE
echo $stderr>> $LOG_FILE
echo $syslog >> $LOG_FILE
echo $jobconf>> $LOG_FILE

  1. Write your own debug script using the above example. In the above example, edit HADOOP_HOME to point to your HADOOP_HOME directory.

    src/chapter3/WordcountWithDebugScript.java extends the WordCount sample to use debug scripts. The following listing shows the code.

    The following code uploads the job scripts to HDFS and configures the job to use these scripts. Also, it sets up the distributed cache.

    private static final String scriptFileLocation =
    "resources/chapter3/debugscript";
    public static void setupFailedTaskScript(JobConfconf)
    throws Exception {
    // create a directory on HDFS where we'll upload the fail
    //scripts
    FileSystemfs = FileSystem.get(conf);
    Path debugDir = new Path("/debug");
    // who knows what's already in this directory; let's just
    //clear it.
    if (fs.exists(debugDir)) {
    fs.delete(debugDir, true);
    }
    // ...and then make sure it exists again
    fs.mkdirs(debugDir);
    // upload the local scripts into HDFS
    fs.copyFromLocalFile(new Path(scriptFileLocation),
    new Path("/debug/fail-script"));
    conf.setMapDebugScript("./fail-script");
    conf.setReduceDebugScript("./fail-script");
    DistributedCache.createSymlink(conf);
    URI fsUri = fs.getUri();
    String mapUriStr = fsUri.toString()
    + "/debug/fail-script#fail-script";
    URI mapUri = new URI(mapUriStr);
    DistributedCache.addCacheFile(mapUri, conf);
    }

    The following code runs the Hadoop job.. The only difference is that here, we have called the preceding method to configure failed task scripts.

    public static void main(String[] args) throws Exception
    {
    JobConfconf = new JobConf();
    setupFailedTaskScript(conf);
    Job job = new Job(conf, "word count");
    job.setJarByClass(FaultyWordCount.class);
    job.setMapperClass(FaultyWordCount.TokenizerMapper.class);
    job.setReducerClass(FaultyWordCount.IntSumReducer.class);
    job.setOutputKeyClass(Text.class);
    job.setOutputValueClass(IntWritable.class);
    FileInputFormat.addInputPath(job, new Path(args[0]));
    FileOutputFormat.setOutputPath(job, new Path(args[1]));
    job.waitForCompletion(true);
    }

  2. Compile the code base by running Ant from home directory of the source code. Copy the build/lib/hadoop-cookbook-chapter3.jar to HADOOP_HOME.

  3. Then run the job by running the following command:

    >bin/hadoopjarhadoop-cookbook-chapter3.jarchapter3.
    WordcountWithDebugScript /data/input /data/output1

    The job will run the FaultyWordCount task that will always fail. Then Hadoop will execute the debug script, and you can find the results of the debug script from HADOOP_HOME.

How it works...

We configured the debug script through conf.setMapDebugScript("./fail-script"). However, the input value is not the file location, but the command that needs to be run on the machine when an error occurs. If you have a specific file that is present in all machines that you want to run when an error occurs, you can just add that path through the conf.setMapDebugScript("./fail-script") method.

But, Hadoop runs the mappers in multiple nodes, and often in a machine different than the machine running the job's client. Therefore, for the debug script to work, we need to get the script to all the nodes running the mapper.

We do this using the distributed cache. The users can add files that are in the HDFS filesystem to distribute cache. Then, Hadoop automatically copies those files to each node by running map tasks. However, distributed cache copies the files to mapred.local.dir of the MapReduce setup, but it runs the job from a different location. Therefore, we link the cache directory to the working directory by creating a symlink using the DistributedCache.createSymlink(conf) command.

Then Hadoop copies the script files to each mapper node and symlinks it to the working directory of the job. When an error occurs, Hadoop will run the ./fail-script command, which will run the script file that has been copied to the node through distributed cache. The debug script will carry out the tasks you have programmed when an error occurs.

Setting failure percentages and skipping bad records

When processing a large amount of data, there may be cases where a small amount of map tasks will fail, but still the final results make sense without the failed map tasks. This could happen due to a number of reasons such as:

  • Bugs in the map task

  • Small percentage of data records are not well formed

  • Bugs in third-party libraries

In the first case, it is best to debug, find the cause for failures, and fix it. However, in the second and third cases, such errors may be unavoidable. It is possible to tell Hadoop that the job should succeed even if some small percentage of map tasks have failed.

This can be done in two ways:

  • Setting the failure percentages

  • Asking Hadoop to skip bad records

This recipe explains how to configure this behavior.

Getting ready

Start the Hadoop setup.

How to do it...

Run the WordCount sample by passing the following options:

>bin/hadoop jar hadoop-examples-1.0.0.jar wordcount
-Dmapred.skip.map.max.skip.records=1
-Dmapred.skip.reduce.max.skip.groups=1 /data/input1 /data/output1

However, this only works if the job implements the org.apache.hadoop.util.Tools interface. Otherwise, you should set it through JobConf.set(name, value).

How it works...

Hadoop does not support skipping bad records by default. We can turn on bad record skipping by setting the following parameters to positive values:

  • mapred.skip.map.max.skip.records: This sets the number of records to skip near a bad record, including the bad record

  • mapred.skip.reduce.max.skip.groups: This sets the number of acceptable skip groups surrounding a bad group

There's more...

You can also limit the percentage of failures in map or reduce tasks by setting the JobConf. setMaxMapTaskFailuresPercent(percent) and JobConf.setMaxReduceTaskFail uresPercent(percent) options.

Also, Hadoop repeats the tasks in case of a failure. You can control that through JobConf. setMaxMapAttempts(5).

Hadoop MapReduce Cookbook Recipes for analyzing large and complex datasets with Hadoop MapReduce book and ebook.
Published: January 2013
eBook Price: $29.99
Book Price: $49.99
See more
Select your format and quantity:

Shared-user Hadoop clusters – using fair and other schedulers

When a user submits a job to Hadoop, this job needs to be assigned a resource (a computer/host) before execution. This process is called scheduling, and a scheduler decides when resources are assigned to a given job.

Hadoop is by default configured with a First in First out (FIFO) scheduler, which executes jobs in the same order as they arrive. However, for a deployment that is running many MapReduce jobs and shared by many users, more complex scheduling policies are needed.

The good news is that Hadoop scheduler is pluggable, and it comes with two other schedulers. Therefore, if required, it is possible to write your own scheduler as well.

  • Fair scheduler: This defines pools and over time; each pool gets around the same amount of resources.

  • Capacity scheduler: This defines queues, and each queue has a guaranteed capacity. The capacity scheduler shares computer resources allocated to a queue with other queues if those resources are not in use.

This recipe describes how to change the scheduler in Hadoop.

Getting ready

For this recipe, you need a working Hadoop deployment. Set up Hadoop.

How to do it...

  1. Shut down the Hadoop cluster.

  2. You need hadoop-fairscheduler-1.0.0.jar in the HADOOP_HOME/lib. However, from Hadoop 1.0.0 and higher releases, this JAR file is in the right place in the Hadoop distribution.

  3. Add the following code to the HADOOP_HOME/conf/mapred-site.xml:

    <property>
    <name>mapred.jobtracker.taskScheduler</name>
    <value>org.apache.hadoop.mapred.FairScheduler</value>
    </property>

  4. Restart Hadoop.

  5. Verify that the new scheduler has been applied by going to http://:50030/scheduler in your installation. If the scheduler has been properly applied, the page will have the heading "Fair Scheduler Administration".

How it works...

When you follow the preceding steps, Hadoop will load the new scheduler settings when it is started. The fair scheduler shares equal amount of resources between users unless it has been configured otherwise.

The fair scheduler supports users to configure it through two ways. There are several parameters of the mapred.fairscheduler.* form, and we can configure these parameters via HADOOP_HOME/conf/mapred-site.xml. Also additional parameters can be configured via HADOOP_HOME/conf/fair-scheduler.xml. More details about fair scheduler can be found from HADOOP_HOME/docs/fair_scheduler.html.

There's more...

Hadoop also includes another scheduler called capacity scheduler that provides more fine-grained control than the fair scheduler. More details about the capacity scheduler can be found from HADOOP_HOME/docs/capacity_scheduler.html.

Hadoop security – integrating with Kerberos

Hadoop by default runs without security. However, it also supports Kerberos-based setup, which provides full security. This recipe describes how to configure Hadoop with Kerberos for security.

Kerberos setups will include a Hadoop cluster—NameNode, DataNodes, JobTracker, and TaskTrackers—and a Kerberos server. We will define users as principals in the Kerberos server. Users can obtain a ticket from the Kerberos server, and use that ticket to log in to any server in Hadoop. We will map each Kerberos principal with a Unix user. Once logged in, the authorization is performed based on the Unix user and group permissions associated with each user.

Getting ready

Set up Hadoop. We need a machine to use as the Kerberos node for which you have root access. Furthermore, the machine should have the domain name already configured (we will assume DNS name is hadoop.kbrelam.com, but you can replace it with another domain). If you want to try this out in a single machine only, you can set up the DNS name through adding your IP address hadoop.kbrelam.com to your /etc/hosts file.

How to do it...

  1. Install Kerberos on your machine. Refer to http://web.mit.edu/Kerberos/ krb5-1.8/krb5-1.8.6/doc/krb5-install.html for further instructions on setting up Kerberos.

    Provide hadoop.kbrelam.com as the realm and the administrative server when installation asks for it. Then run the following command to create a realm:

    >sudo krb5_newrealm

  2. In Kerberos, we call users "principals". Create a new principal by running following commands:

    >kadmin.local
    >kadmin.local: add principal srinath/admin

  3. Edit /etc/krb5kdc/kadm5.acl to include the line srinath/admin@hadoop. kbrelam.com * to grant all the permissions.

  4. Restart the Kerberos server by running the following command:

    >sudo /etc/init.d/krb5-admin-server restart.

  5. You can test the new principal by running following commands:

    >kinitsrinath/admin
    >klist

  6. Kerberos will use Unix users in Hadoop machines as Kerberos principals and use local Unix-level user permissions to do authorization. Create the following users and groups with permissions in all the machines on which you plan to run MapReduce.

    We will have three users—hdfs to run HDFS server, mapred to run MapReduce server, and bob to submit jobs.

    >groupaddhadoop
    >useraddhdfs
    >useraddmapred
    >usermod -g hadoophdfs
    >usermod -g hadoopmapred
    >useradd -G mapred bob
    >usermod -a -G hadoop bob

  7. Now let us create Kerberos principals for these users.

    >kadmin.local
    >kadmin.local: addprinc -randkey
    hdfs/hadoop.kbrelam.com
    >kadmin.local: addprinc –randkey
    mapred/hadoop.kbrelam.com
    >kadmin.local: addprinc -randkey
    host/hadoop.kbrelam.com
    >kadmin.local: addprinc -randkey
    bob/hadoop.kbrelam.com

  8. Now, we will create a key tab file that contains credentials for Kerberos principals. We will use these credentials to avoid entering the passwords at Hadoop startup.

    >kadmin: xst -norandkey -k hdfs.keytab hdfs/hadoop.kbrelam.com
    host/hadoop.kbrelam.com
    >kadmin: xst -norandkey -k mapred.keytab mapred/hadoop.kbrelam.
    com host/hadoop.kbrelam.com
    >kadmin.local: xst -norandkey -k bob.keytab bob/hadoop.kbrelam.
    com
    >kadmin.local: exit

  9. Deploy key tab files by moving them in to the HADOOP_HOME/conf directory. Change the directory to HADOOP_HOME and run following commands to set the permissions for key tab files:

    >chownhdfs:hadoopconf/hdfs.keytab
    >chownmapred:hadoopconf/mapred.keytab

  10. Now, set permissions in the filesystem and Hadoop. Change the directory to HADOOP_HOME and run the following commands:

    >chownhdfs:hadoop /opt/hadoop-work/name/
    >chownhdfs:hadoop /opt/hadoop-work/data
    >chownmapred:hadoop /opt/hadoop-work/local/
    >bin/hadoopfs -chownhdfs:hadoop /
    >bin/hadoopfs -chmod 755 /
    >bin/hadoopfs -mkdir /mapred
    >bin/hadoopfs -mkdir /mapred/system/
    >bin/hadoopfs -chownmapred:hadoop /mapred/system
    >bin/hadoopfs -chmod -R 700 /mapred/system
    >bin/hadoopfs -chmod 777 /tmp

  11. Install Unlimited Strength Java Cryptography Extension (JCE) Policy Files by downloading the policy files from http://www.oracle.com/technetwork/ java/javase/downloads/index.html and copying the JAR files in the distribution to JAVA_HOME/jre/lib/security.

  12. Configure Hadoop properties by adding following properties to the associated configuration files. Replace the HADOOP_HOME value with the corresponding location. Here, Hadoop will replace the _HOST with the localhost name. The following code snippet adds properties to core-site.xml:

    <property>
    <name>hadoop.security.authentication</name>
    <value>kerberos</value>
    </property>
    <property>
    <name>hadoop.security.authorization</name>
    <value>true</value>
    </property>

  13. Copy the configuration parameters defined in resources/chapter3/kerberoshdfs- site.xml of the source code for this chapter to the HADOOP_HOME/conf/ hdfs-site.xml. Replace the HADOOP_HOME value with the corresponding location. Here Hadoop will replace the _HOST with the localhost name.

  14. Start the NameNode by running the following commands from HADOOP_HOME:

    >sudo -u hdfs bin/hadoopnamenode &

  15. Test HDFS setup by doing some metadata operations.

    >kinit hdfs/hadoop.kbrelam.com -k -t conf/hdfs.keytab
    >klist
    >kinit –R

    In the first command, we specify the name of the principal (for example, hdfs/hadoop.kbrelam.com) to apply operations to that principal. The first two commands are theoretically sufficient. However, there is a bug that stops Hadoop from reading the credentials. We can work around this by the last command that rewrites the key in more readable format. Now let's run hdfs commands.

    >bin/hadoopfs -ls /

  16. Start the DataNode (this must be done as the root) by running following command:

    >su - root
    >cd /opt/hadoop-1.0.3/
    >export HADOOP_SECURE_DN_USER=hdfs
    >export HADOOP_DATANODE_USER=hdfs
    >bin/hadoopdatanode &
    >exit

  17. Configure mapred by adding the following code to conf/map-red.xml. Replace HADOOP_HOME with the corresponding location.

    <property>
    <name>mapreduce.jobtracker.kerberos.principal</name>
    <value>mapred/_HOST@hadoop.kbrelam.com</value>
    </property>
    <property>
    <name>mapreduce.jobtracker.kerberos.https.principal</
    name><value>host/_HOST@hadoop.kbrelam.com</value>
    </property>
    <property>
    <name>mapreduce.jobtracker.keytab.file</name>
    <value>HADOOP_HOME/conf/mapred.keytab</value><!-- path to the
    MapReducekeytab -->
    </property><!-- TaskTracker security configs -->
    <property>
    <name>mapreduce.tasktracker.kerberos.principal</name>
    <value>mapred/_HOST@hadoop.kbrelam.com</value>
    </property>
    <property>
    <name>mapreduce.tasktracker.kerberos.https.principal</name>
    <value>host/_HOST@hadoop.kbrelam.com</value>
    </property>
    <property>
    <name>mapreduce.tasktracker.keytab.file</name>
    <value>HADOOP_HOME/conf/mapred.keytab</value><!-- path to the
    MapReducekeytab -->
    </property><!-- TaskController settings -->
    <property>
    <name>mapred.task.tracker.task-controller</name><value>org.apache.
    hadoop.mapred.LinuxTaskController</value>
    </property>
    <property>
    <name>mapreduce.tasktracker.group</name>
    <value>mapred</value>
    </property>

  18. Configure the Linux task controller, which must be used for Kerberos setup.

    >mkdir /etc/hadoop
    >cpconf/taskcontroller.cfg /etc/hadoop/taskcontroller.cfg
    >chmod 755 /etc/hadoop/taskcontroller.cfg

  19. Add the following code to /etc/hadoop/taskcontroller.cfg:

    mapred.local.dir=/opt/hadoop-work/local/
    hadoop.log.dir=HADOOP_HOME/logs
    mapreduce.tasktracker.group=mapred
    banned.users=mapred,hdfs,bin
    min.user.id=1000

    Set up the permissions by running the following command from HADOOP_HOME, and verify that the final permissions of bin/task-controller are rwsr-x---. Otherwise, the jobs will fail to execute.

    >chmod 4750 bin/task-controller
    >ls -l bin/task-controller
    >-rwsr-x--- 1 root mapred 63374 May 9 02:05 bin/task-controller

  20. Start the JobTracker and TaskTracker:

    >sudo -u mapred bin/hadoopjobtracker

    Wait for the JobTracker to start up and then run the following command:

    >sudo -u mapred bin/hadooptasktracker

  21. Run the job by running following commands from HADOOP_HOME. If all commands run successfully, you will see the WordCount output.

    >su bob
    >kinit bob/hadoop.kbrelam.com -k -t conf/bob.keytab
    >kinit –R
    >bin/hadoopfs -mkdir /data
    >bin/hadoopfs -mkdir /data/job1
    >bin/hadoopfs -mkdir /data/job1/input
    >bin/hadoopfs -put README.txt /data/job1/input
    >bin/hadoop jar hadoop-examples-1.0.3.jar wordcount /data/job1 /
    data/output

How it works...

By running the kinit command, the client would obtain a Kerberos ticket and store it in the filesystem. When we run the command, the client uses the Kerberos ticket to get access to the Hadoop nodes and submit jobs. Hadoop resolves the permission based on the user and group permissions of the Linux users that matches the Kerberos principal.

Hadoop Kerberos security settings have many pitfalls. The two tools that might be useful are as follows:

  • You can enable debugging by adding the environment variable HADOOP_ OPTS="$HADOOP_CLIENT_OPTS -Dsun.security.krb5.debug=true"

  • There is a very useful resource that has descriptions for all error codes:

    https://ccp.cloudera.com/display/CDHDOC/Appendix+E+-+Taskcontroller+ Error+Codes

    Also, when you change something, make sure you restart all the processes first by killing all the running processes.

Using the Hadoop Tool interface

Often Hadoop jobs are executed through a command line. Therefore, each Hadoop job has to support reading, parsing, and processing command-line arguments. To avoid each developer having to rewrite this code, Hadoop provides a org.apache.hadoop.util.Tool interface.

How to do it...

  1. See the following code:

    public class WordcountWithTools extends
    Configured implements Tool
    {
    public int run(String[] args) throws Exception
    {
    if (args.length< 2)
    {
    System.out.println("chapter3.WordCountWithTools
    WordCount<inDir><outDir>");
    ToolRunner.printGenericCommandUsage(System.out);
    System.out.println("");
    return -1;
    }
    Job job = new Job(getConf(), "word count");
    job.setJarByClass(WordCount.class);
    job.setMapperClass(TokenizerMapper.class);
    job.setReducerClass(IntSumReducer.class);
    job.setOutputKeyClass(Text.class);
    job.setOutputValueClass(IntWritable.class);
    FileInputFormat.addInputPath(job, new Path(args[0]));
    FileOutputFormat.setOutputPath(job, new Path(args[1]));
    job.waitForCompletion(true);
    return 0;
    }
    public static void main(String[] args)
    throws Exception
    {
    int res = ToolRunner.run(
    new Configuration(), new WordcountWithTools(), args);
    System.exit(res);
    }

  2. Set up a input folder in HDFS with /data/input/README.txt if it doesn't already exist. It can be done through following commands:

    bin/hadoopfs -mkdir /data/output
    bin/hadoopfs -mkdir /data/input
    bin/hadoopfs -put README.txt /data/input

  3. Try to run the WordCount without any options, and it will list the available options.

    bin/hadoop jar hadoop-cookbook-chapter3.jar chapter3.
    WordcountWithToolsWordcount <inDir><outDir>
    Generic options supported are
    -conf<configuration file> specify an application configuration
    file
    -D <property=value> use value for given property
    -fs<local|namenode:port> specify a namenode
    -jt<local|jobtracker:port> specify a job tracker
    -files<comma separated list of files> specify comma separated
    files to be copied to the map reduce cluster
    -libjars<comma separated list of jars> specify comma separated
    jar files to include in the classpath.
    -archives<comma separated list of archives> specify comma
    separated archives to be unarchived on the compute machines.
    The general command line syntax is
    bin/hadoop command [genericOptions] [commandOptions]

  4. Run the WordCount sample with the mapred.job.reuse.jvm.num.tasks option to limit the number of JVMs created by the job, as we learned in an earlier recipe.

    bin/hadoop jar hadoop-cookbook-chapter3.jar
    chapter3.WordcountWithTools
    -D mapred.job.reuse.jvm.num.tasks=1 /data/input /data/output

How it works...

When a job extends from the Tool interface, Hadoop will intercept the command-line arguments, parse the options, and configure the JobConf object accordingly. Therefore, the job will support standard generic options.

Summary

This article described how to perform advanced administration steps for your Hadoop Cluster.

Resources for Article :


Further resources on this subject:


About the Author :


Srinath Perera

Srinath Perera is a senior software architect at WSO2 Inc., where he overlooks the overall WSO2 platform architecture with the CTO. He also serves as a research scientist at Lanka Software Foundation and teaches as a visiting faculty at Department of Computer Science and Engineering, University of Moratuwa. He is a co-founder of Apache Axis2 open source project, and he has been involved with the Apache Web Service project since 2002 and is a member of Apache Software foundation and Apache Web Service project PMC. He is also a committer of Apache open source projects Axis, Axis2, and Geronimo.

He received his Ph.D. and M.Sc. in Computer Sciences from Indiana University, Bloomington, USA and received his Bachelor of Science in Computer Science and Engineering degree from the University of Moratuwa, Sri Lanka.

He has authored many technical and peer reviewed research articles, and more details can be found on his website. He is also a frequent speaker at technical venues.

He has worked with large-scale distributed systems for a long time. He closely works with Big Data technologies like Hadoop and Cassandra daily. He also teaches a parallel programming graduate class at University of Moratuwa, which is primarily based on Hadoop.

Thilina Gunarathne

Thilina Gunarathne is a Ph.D. candidate at the School of Informatics and Computing of Indiana University. He has extensive experience in using Apache Hadoop and related technologies for large-scale data intensive computations. His current work focuses on developing technologies to perform scalable and efficient large-scale data intensive computations on cloud environments.

Thilina has published many articles and peer reviewed research papers in the areas of distributed and parallel computing, including several papers on extending MapReduce model to perform efficient data mining and data analytics computations on clouds. Thilina is a regular presenter in both academic as well as industry settings.

Thilina has contributed to several open source projects at Apache Software Foundation as a committer and a PMC member since 2005. Before starting the graduate studies, Thilina worked as a Senior Software Engineer at WSO2 Inc., focusing on open source middleware development. Thilina received his B.Sc. in Computer Science and Engineering from University of Moratuwa, Sri Lanka, in 2006 and received his M.Sc. in Computer Science from Indiana University, Bloomington, in 2009. Thilina expects to receive his doctorate in the field of distributed and parallel computing in 2013.

Books From Packt


Hadoop Beginner's Guide
Hadoop Beginner's Guide

Hadoop Real-World Solutions Cookbook
Hadoop Real-World Solutions Cookbook

Cassandra High Performance Cookbook: Second Edition
Cassandra High Performance Cookbook: Second Edition

Cassandra High Performance Cookbook
Cassandra High Performance Cookbook

HBase Administration Cookbook
HBase Administration Cookbook

Instant Apache Cassandra for Developers Starter [Instant]
Instant Apache Cassandra for Developers Starter [Instant]

Puppet 2.7 Cookbook
Puppet 2.7 Cookbook

Mind Mapping with FreeMind
Mind Mapping with FreeMind


No votes yet

Post new comment

CAPTCHA
This question is for testing whether you are a human visitor and to prevent automated spam submissions.
W
D
C
M
h
Y
Enter the code without spaces and pay attention to upper/lower case.
Code Download and Errata
Packt Anytime, Anywhere
Register Books
Print Upgrades
eBook Downloads
Video Support
Contact Us
Awards Voting Nominations Previous Winners
Judges Open Source CMS Hall Of Fame CMS Most Promising Open Source Project Open Source E-Commerce Applications Open Source JavaScript Library Open Source Graphics Software
Resources
Open Source CMS Hall Of Fame CMS Most Promising Open Source Project Open Source E-Commerce Applications Open Source JavaScript Library Open Source Graphics Software