Packt+ | Advance your knowledge in tech

You're reading from Apache Mahout Essentials

Product typeBook

Published inJun 2015

Reading LevelIntermediate

Publisher

ISBN-139781783554997

Edition1st Edition

Languages

Java

Tools

Mahout

Concepts

Machine Learning

Author (1)

Jayani Withanawasam

Chapter 5. Apache Mahout in Production

This chapter talks about achieving scalability in Apache Mahout with an Apache Hadoop ecosystem.

In this chapter, we will cover the following topics:

Key components of Apache Hadoop
The life cycle of a Hadoop application
Setting up Hadoop
- Local mode
- The pseudo-distributed mode
- The fully-distributed mode
Setting up Apache Mahout with Hadoop
Monitoring Hadoop
Troubleshooting Hadoop
Optimization tips

Introduction

So far, we have discussed key machine learning techniques, such as clustering, classification, and recommendations. However, there are several machine learning libraries, such as MATLAB, R, and Weka out there to implement the preceding techniques.

The volume of available information is growing at an alarming rate. Most of the time, analyzing enormous datasets causes processors to run out of memory. Hence, processing large datasets or datasets with an exponential growth potential is a key challenge in modern machine learning applications.

The key characteristic that makes Apache Mahout shine out from other machine learning libraries is its ability to scale.

In this chapter, you will see how Apache Mahout achieves scalability in a production environment with Apache Hadoop.

Apache Mahout with Hadoop

Apache Mahout uses Apache Hadoop, which is a distributed computing framework, to achieve scalability. The following figure clearly shows the place where Apache Hadoop fits into Apache Mahout:

As shown in the previous figure, Yarn (Data processing) and HDFS (Data Storage) are key components in Apache Hadoop.

In this chapter, we will explain the important subcomponents of Yet Another Resource Negotiator (YARN) and HDFS and their behavior in detail before proceeding to the Hadoop installation steps.

YARN with MapReduce 2.0

First, let's understand YARN, which is a new addition to Apache Hadoop 2.0.

Earlier, Apache Hadoop operated with MapReduce 1.0. It had some drawbacks in cluster resource utilization due to the constraints incurred with the static allocation of map and reduce slots.

YARN, along with MapReduce 2.0, has overcome this drawback by inventing a novel, flexible resource allocation model that contains containers.

The YARN architecture consists of the following subcomponents...

Setting up Hadoop

If you want to run Apache Mahout in local mode (without Hadoop), then you need to set some value for the MAHOUT_LOCAL environment variable, as follows:

Set MAHOUT_LOCAL=true

Also, if HADOOP_HOME is not set, then Apache Mahout runs locally.

So, if you want to run Apache Mahout with Hadoop, then there are three possible options available:

Local mode
The pseudo-distributed mode
The fully-distributed mode

You can select the Hadoop mode that best suits you, depending on the requirement at hand.

Setting up Mahout in local mode

Local mode is the simplest of all modes in Hadoop with the least number of configuration changes.

Hadoop is running as a single JVM instance in this mode. Hadoop daemons, such as resource manager, name node, node manager, data nodes, and secondary node are not running. Also, there is no HDFS-related file processing with this mode.

Prerequisites

The Hadoop framework is an open source software implementation in Java.

Java installation

Hadoop requires Java 7 or a later...

Monitoring Hadoop

Apache Hadoop daemons can be monitored using different mechanisms.

Commands/scripts

The running JVMs related to Hadoop can be displayed using the following command (use the correct Java installation location):

/usr/lib/jvm/java-7-oracle/bin/jps

The outcome of the preceding command is given in the following figure:

Data nodes

Active data nodes in the cluster can be displayed using the following command:

[Hadoop installation directory]/bin/hdfs dfsadmin –report

The outcome of the preceding command for a cluster with two data nodes is shown in the following figure:

Node managers

Active node managers can be monitored using the following command:

 [Hadoop installation directory]/bin /yarn node –list

The outcome of the preceding command for a cluster with two node managers is shown in the following figure:

Web UIs

Apache Hadoop has provided Web UIs to monitor MapReduce job processing details.

As shown in the following figure, NameNode operations in HDFS can be monitored at http://localhost...

Setting up Mahout with Hadoop's fully-distributed mode

Once Apache Hadoop is successfully installed, we can integrate Apache Mahout with it using the following simple steps:

Download and install Apache Mahout.

Set the following environment variables:

HADOOP_CONF_DIR="[HADOOP INSTALLATION DIRECTORY]/etc/hadoop"
HADOOP_HOME="[HADOOP INSTALLATION DIRECTORY]"
MAHOUT_HOME="[MAHOUT INSTALLATION DIRECTORY]"

Troubleshooting Hadoop

During the installation process, you might encounter issues related to configuration values, ports, and connectivity problems. Even though it is not possible to provide solutions for each and every potential issue that you might encounter, the following hints will be helpful to troubleshoot effectively and efficiently:

Check the following environment variable values for different logs:
```
MAHOUT_LOG_DIR
MAHOUT_LOGFILE
```
Check the log files at the following location for Hadoop application specific issues:
```
[Hadoop installation directory]/logs/user logs
```
Make sure that hostnames are specified correctly across all the nodes in the cluster:
```
Check the /etc/hosts file for correct IP/ host name mapping in all nodes
```
Check port numbers for accuracy in the configuration files, and check whether you have given hostname:port correctly in all the relevant configuration files.

Optimization tips

Configuring the values of the following configuration entries according to the hardware/software configurations of the Hadoop cluster helps to use the available resources, such as CPU and memory, optimally.

The important configurations in the mapred-site.xml file are given as follows:

Set the maximum tasks that can be executed in the map phase and the reduce phase:

mapreduce.tasktracker.map.tasks.maximum
mapreduce.tasktracker.reduce.tasks.maximum

Set the number of map and reduce tasks according to number of cores available:
```
mapreduce.job.reduces
mapreduce.job.maps
```

The important configurations in the hdfs-site.xml file are given as follows:

Set the block size for the files according to the storage requirements of your problem:
```
dfs.blocksize
```

However, discussing the performance-tuning approaches for Hadoop in detail is beyond the scope of this book.

Summary

Apache Hadoop plays a key role in Apache Mahout's scalability, which differentiates it from other machine learning libraries.

Apache Hadoop provides data processing (YARN) and data storage (HDFS) capabilities to Apache Mahout. The key components of Apache Hadoop (daemons) are the resource manager, node managers, name node, data nodes, and secondary node.

Apache Hadoop can be installed in three different modes, namely local mode, pseudo-distributed mode, and fully-distributed mode.

Furthermore, Apache Hadoop provides scripts and Web UIs to monitor its daemons.

In next chapter, we will discuss visualization techniques in Apache Mahout.

The rest of the chapter is locked

You have been reading a chapter from

Apache Mahout Essentials

Published in: Jun 2015Publisher: ISBN-13: 9781783554997

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at $15.99/month. Cancel anytime

Author (1)

Jayani Withanawasam

Jayani Withanawasam is R&D engineer and a senior software engineer at Zaizi Asia, where she focuses on applying machine learning techniques to provide smart content management solutions. She is currently pursuing an MSc degree in artificial intelligence at the University of Moratuwa, Sri Lanka, and has completed her BE in software engineering (with first class honors) from the University of Westminster, UK. She has more than 6 years of industry experience, and she has worked in areas such as machine learning, natural language processing, and semantic web technologies during her tenure. She is passionate about working with semantic technologies and big data.
Read more about Jayani Withanawasam

Personalised recommendations for you

Based on your interests and search pattern

Et al.

Ever wonder why speech recognition systems don't understand the Scottish accent, or what would happen if an astronaut only ate mac 'n' cheese, or other spurious reflections you'd have at a bar? We did, then collated those deliberations into absurd research articles with fake figures and methodologies inspired by even more fictionally absurd studies.

BookAug 2023230 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages4

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages1

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Mastering Tableau 2023

This book is a comprehensive resource to mastering your Tableau skills and becoming a BI expert. As you progress, you will learn how to build advanced dashboards and improve your storytelling to derive key business insight, as well as make you well-versed with advanced functionalities of Tableau in the business intelligence domain.

BookAug 2023684 pages

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages5

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages2

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

BookOct 2023636 pages5

Modern Data Architecture on AWS

Every organization wants an agile, performant, and cost-effective data platform that meets all their current and future business needs. Purpose-built AWS analytics services and their features play a big part in building such a modern data platform. This book brings to you all the design and architectural patterns that’ll help you achieve this goal.

BookAug 2023420 pages5

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

BookDec 2023240 pages

TinyML Cookbook

With over 70 project-based recipes, the TinyML Cookbook is a practical guide that will help you to get the most out of your microcontrollers. It provides a comprehensive understanding of the theoretical foundations while giving you hands-on experience training ML models for deployment on Arduino Nano 33 BLE Sense, Raspberry Pi Pico, and SparkFun RedBoard Artemis Nano microcontrollers.

BookNov 2023664 pages