Reader small image

You're reading from  Apache Mahout Essentials

Product typeBook
Published inJun 2015
Reading LevelIntermediate
Publisher
ISBN-139781783554997
Edition1st Edition
Languages
Tools
Right arrow
Author (1)
Jayani Withanawasam
Jayani Withanawasam
author image
Jayani Withanawasam

Jayani Withanawasam is R&D engineer and a senior software engineer at Zaizi Asia, where she focuses on applying machine learning techniques to provide smart content management solutions. She is currently pursuing an MSc degree in artificial intelligence at the University of Moratuwa, Sri Lanka, and has completed her BE in software engineering (with first class honors) from the University of Westminster, UK. She has more than 6 years of industry experience, and she has worked in areas such as machine learning, natural language processing, and semantic web technologies during her tenure. She is passionate about working with semantic technologies and big data.
Read more about Jayani Withanawasam

Right arrow

Chapter 5. Apache Mahout in Production

This chapter talks about achieving scalability in Apache Mahout with an Apache Hadoop ecosystem.

In this chapter, we will cover the following topics:

  • Key components of Apache Hadoop

  • The life cycle of a Hadoop application

  • Setting up Hadoop

    • Local mode

    • The pseudo-distributed mode

    • The fully-distributed mode

  • Setting up Apache Mahout with Hadoop

  • Monitoring Hadoop

  • Troubleshooting Hadoop

  • Optimization tips

Introduction


So far, we have discussed key machine learning techniques, such as clustering, classification, and recommendations. However, there are several machine learning libraries, such as MATLAB, R, and Weka out there to implement the preceding techniques.

The volume of available information is growing at an alarming rate. Most of the time, analyzing enormous datasets causes processors to run out of memory. Hence, processing large datasets or datasets with an exponential growth potential is a key challenge in modern machine learning applications.

The key characteristic that makes Apache Mahout shine out from other machine learning libraries is its ability to scale.

In this chapter, you will see how Apache Mahout achieves scalability in a production environment with Apache Hadoop.

Apache Mahout with Hadoop


Apache Mahout uses Apache Hadoop, which is a distributed computing framework, to achieve scalability. The following figure clearly shows the place where Apache Hadoop fits into Apache Mahout:

As shown in the previous figure, Yarn (Data processing) and HDFS (Data Storage) are key components in Apache Hadoop.

In this chapter, we will explain the important subcomponents of Yet Another Resource Negotiator (YARN) and HDFS and their behavior in detail before proceeding to the Hadoop installation steps.

YARN with MapReduce 2.0

First, let's understand YARN, which is a new addition to Apache Hadoop 2.0.

Earlier, Apache Hadoop operated with MapReduce 1.0. It had some drawbacks in cluster resource utilization due to the constraints incurred with the static allocation of map and reduce slots.

YARN, along with MapReduce 2.0, has overcome this drawback by inventing a novel, flexible resource allocation model that contains containers.

The YARN architecture consists of the following subcomponents...

Setting up Hadoop


If you want to run Apache Mahout in local mode (without Hadoop), then you need to set some value for the MAHOUT_LOCAL environment variable, as follows:

Set MAHOUT_LOCAL=true

Also, if HADOOP_HOME is not set, then Apache Mahout runs locally.

So, if you want to run Apache Mahout with Hadoop, then there are three possible options available:

  • Local mode

  • The pseudo-distributed mode

  • The fully-distributed mode

You can select the Hadoop mode that best suits you, depending on the requirement at hand.

Setting up Mahout in local mode

Local mode is the simplest of all modes in Hadoop with the least number of configuration changes.

Hadoop is running as a single JVM instance in this mode. Hadoop daemons, such as resource manager, name node, node manager, data nodes, and secondary node are not running. Also, there is no HDFS-related file processing with this mode.

Prerequisites

The Hadoop framework is an open source software implementation in Java.

Java installation

Hadoop requires Java 7 or a later...

Monitoring Hadoop


Apache Hadoop daemons can be monitored using different mechanisms.

Commands/scripts

The running JVMs related to Hadoop can be displayed using the following command (use the correct Java installation location):

/usr/lib/jvm/java-7-oracle/bin/jps

The outcome of the preceding command is given in the following figure:

Data nodes

Active data nodes in the cluster can be displayed using the following command:

[Hadoop installation directory]/bin/hdfs dfsadmin –report

The outcome of the preceding command for a cluster with two data nodes is shown in the following figure:

Node managers

Active node managers can be monitored using the following command:

 [Hadoop installation directory]/bin /yarn node –list

The outcome of the preceding command for a cluster with two node managers is shown in the following figure:

Web UIs

Apache Hadoop has provided Web UIs to monitor MapReduce job processing details.

As shown in the following figure, NameNode operations in HDFS can be monitored at http://localhost...

Setting up Mahout with Hadoop's fully-distributed mode


Once Apache Hadoop is successfully installed, we can integrate Apache Mahout with it using the following simple steps:

  1. Download and install Apache Mahout.

  2. Set the following environment variables:

    HADOOP_CONF_DIR="[HADOOP INSTALLATION DIRECTORY]/etc/hadoop"
    HADOOP_HOME="[HADOOP INSTALLATION DIRECTORY]"
    MAHOUT_HOME="[MAHOUT INSTALLATION DIRECTORY]"
    

Troubleshooting Hadoop


During the installation process, you might encounter issues related to configuration values, ports, and connectivity problems. Even though it is not possible to provide solutions for each and every potential issue that you might encounter, the following hints will be helpful to troubleshoot effectively and efficiently:

  1. Check the following environment variable values for different logs:

    MAHOUT_LOG_DIR
    MAHOUT_LOGFILE
    
  2. Check the log files at the following location for Hadoop application specific issues:

    [Hadoop installation directory]/logs/user logs
    
  3. Make sure that hostnames are specified correctly across all the nodes in the cluster:

    Check the /etc/hosts file for correct IP/ host name mapping in all nodes
    
  4. Check port numbers for accuracy in the configuration files, and check whether you have given hostname:port correctly in all the relevant configuration files.

Optimization tips


Configuring the values of the following configuration entries according to the hardware/software configurations of the Hadoop cluster helps to use the available resources, such as CPU and memory, optimally.

The important configurations in the mapred-site.xml file are given as follows:

  1. Set the maximum tasks that can be executed in the map phase and the reduce phase:

    mapreduce.tasktracker.map.tasks.maximum
    mapreduce.tasktracker.reduce.tasks.maximum
    
  2. Set the number of map and reduce tasks according to number of cores available:

    mapreduce.job.reduces
    mapreduce.job.maps
    

The important configurations in the hdfs-site.xml file are given as follows:

  1. Set the block size for the files according to the storage requirements of your problem:

    dfs.blocksize
    

However, discussing the performance-tuning approaches for Hadoop in detail is beyond the scope of this book.

Summary


Apache Hadoop plays a key role in Apache Mahout's scalability, which differentiates it from other machine learning libraries.

Apache Hadoop provides data processing (YARN) and data storage (HDFS) capabilities to Apache Mahout. The key components of Apache Hadoop (daemons) are the resource manager, node managers, name node, data nodes, and secondary node.

Apache Hadoop can be installed in three different modes, namely local mode, pseudo-distributed mode, and fully-distributed mode.

Furthermore, Apache Hadoop provides scripts and Web UIs to monitor its daemons.

In next chapter, we will discuss visualization techniques in Apache Mahout.

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Apache Mahout Essentials
Published in: Jun 2015Publisher: ISBN-13: 9781783554997
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Jayani Withanawasam

Jayani Withanawasam is R&D engineer and a senior software engineer at Zaizi Asia, where she focuses on applying machine learning techniques to provide smart content management solutions. She is currently pursuing an MSc degree in artificial intelligence at the University of Moratuwa, Sri Lanka, and has completed her BE in software engineering (with first class honors) from the University of Westminster, UK. She has more than 6 years of industry experience, and she has worked in areas such as machine learning, natural language processing, and semantic web technologies during her tenure. She is passionate about working with semantic technologies and big data.
Read more about Jayani Withanawasam