Home

Data

Learning Apache Mahout

Book

eBook $39.99 $27.98

Print $48.99

Subscription $15.99 $10 p/m for three months

BUY NOW

$10 p/m for first 3 months. $15.99 p/m after that. Cancel Anytime!

What do you get with a Packt Subscription?

This book & 7000+ ebooks & video courses on 1000+ technologies

60+ curated reading lists for various learning paths

50+ new titles added every month on new and emerging tech

Early Access to eBooks as they are being written

Personalised content suggestions

Customised display settings for better reading experience

50+ new titles added every month on new and emerging tech

Playlists, Notes and Bookmarks to easily manage your learning

Mobile App with offline access

What do you get with a Packt Subscription?

This book & 6500+ ebooks & video courses on 1000+ technologies

60+ curated reading lists for various learning paths

50+ new titles added every month on new and emerging tech

Early Access to eBooks as they are being written

Personalised content suggestions

Customised display settings for better reading experience

50+ new titles added every month on new and emerging tech

Playlists, Notes and Bookmarks to easily manage your learning

Mobile App with offline access

What do you get with eBook + Subscription?

Download this book in EPUB and PDF formats, plus a monthly download credit

This book & 6500+ ebooks & video courses on 1000+ technologies

60+ curated reading lists for various learning paths

50+ new titles added every month on new and emerging tech

Early Access to eBooks as they are being written

Personalised content suggestions

Customised display settings for better reading experience

50+ new titles added every month on new and emerging tech

Playlists, Notes and Bookmarks to easily manage your learning

Mobile App with offline access

What do you get with a Packt Subscription?

This book & 6500+ ebooks & video courses on 1000+ technologies

60+ curated reading lists for various learning paths

50+ new titles added every month on new and emerging tech

Early Access to eBooks as they are being written

Personalised content suggestions

Customised display settings for better reading experience

50+ new titles added every month on new and emerging tech

Playlists, Notes and Bookmarks to easily manage your learning

Mobile App with offline access

What do you get with eBook?

Download this book in EPUB and PDF formats

Access this title in our online reader

DRM FREE - Read whenever, wherever and however you want

Online reader with customised display settings for better reading experience

What do I get with Print?

Get a paperback copy of the book delivered to your specified Address*

Download this book in EPUB and PDF formats

Access this title in our online reader

DRM FREE - Read whenever, wherever and however you want

Online reader with customised display settings for better reading experience

What do I get with Print?

Get a paperback copy of the book delivered to your specified Address*

Access this title in our online reader

Online reader with customised display settings for better reading experience

What do you get with video?

Download this video in MP4 format

Access this title in our online reader

DRM FREE - Watch whenever, wherever and however you want

Online reader with customised display settings for better learning experience

What do you get with video?

Stream this video

Access this title in our online reader

DRM FREE - Watch whenever, wherever and however you want

Online reader with customised display settings for better learning experience

What do you get with Audiobook?

Download a zip folder consisting of audio files (in MP3 Format) along with supplementary PDF

What do you get with Exam Trainer?

Flashcards, Mock exams, Exam Tips, Practice Questions

Access these resources with our interactive certification platform

Mobile compatible-Practice whenever, wherever, however you want

BUY NOW $10 p/m for first 3 months. $15.99 p/m after that. Cancel Anytime!

eBook $39.99 $27.98

Print $48.99

Subscription $15.99 $10 p/m for three months

What do you get with a Packt Subscription?

This book & 7000+ ebooks & video courses on 1000+ technologies

60+ curated reading lists for various learning paths

50+ new titles added every month on new and emerging tech

Early Access to eBooks as they are being written

Personalised content suggestions

Customised display settings for better reading experience

50+ new titles added every month on new and emerging tech

Playlists, Notes and Bookmarks to easily manage your learning

Mobile App with offline access

What do you get with a Packt Subscription?

This book & 6500+ ebooks & video courses on 1000+ technologies

60+ curated reading lists for various learning paths

50+ new titles added every month on new and emerging tech

Early Access to eBooks as they are being written

Personalised content suggestions

Customised display settings for better reading experience

50+ new titles added every month on new and emerging tech

Playlists, Notes and Bookmarks to easily manage your learning

Mobile App with offline access

What do you get with eBook + Subscription?

Download this book in EPUB and PDF formats, plus a monthly download credit

This book & 6500+ ebooks & video courses on 1000+ technologies

60+ curated reading lists for various learning paths

50+ new titles added every month on new and emerging tech

Early Access to eBooks as they are being written

Personalised content suggestions

Customised display settings for better reading experience

50+ new titles added every month on new and emerging tech

Playlists, Notes and Bookmarks to easily manage your learning

Mobile App with offline access

What do you get with a Packt Subscription?

This book & 6500+ ebooks & video courses on 1000+ technologies

60+ curated reading lists for various learning paths

50+ new titles added every month on new and emerging tech

Early Access to eBooks as they are being written

Personalised content suggestions

Customised display settings for better reading experience

50+ new titles added every month on new and emerging tech

Playlists, Notes and Bookmarks to easily manage your learning

Mobile App with offline access

What do you get with eBook?

Download this book in EPUB and PDF formats

Access this title in our online reader

DRM FREE - Read whenever, wherever and however you want

Online reader with customised display settings for better reading experience

What do I get with Print?

Get a paperback copy of the book delivered to your specified Address*

Download this book in EPUB and PDF formats

Access this title in our online reader

DRM FREE - Read whenever, wherever and however you want

Online reader with customised display settings for better reading experience

What do I get with Print?

Get a paperback copy of the book delivered to your specified Address*

Access this title in our online reader

Online reader with customised display settings for better reading experience

What do you get with video?

Download this video in MP4 format

Access this title in our online reader

DRM FREE - Watch whenever, wherever and however you want

Online reader with customised display settings for better learning experience

What do you get with video?

Stream this video

Access this title in our online reader

DRM FREE - Watch whenever, wherever and however you want

Online reader with customised display settings for better learning experience

What do you get with Audiobook?

Download a zip folder consisting of audio files (in MP3 Format) along with supplementary PDF

What do you get with Exam Trainer?

Flashcards, Mock exams, Exam Tips, Practice Questions

Access these resources with our interactive certification platform

Mobile compatible-Practice whenever, wherever, however you want

About this book

Publication date:: March 2015
Publisher: Packt
Pages: 250
ISBN: 9781783555215

Chapter 1. Introduction to Mahout

Mahout is an open source machine learning library from Apache. Mahout primarily implements clustering, recommender engines (collaborative filtering), classification, and dimensionality reduction algorithms but is not limited to these.

The aim of Mahout is to provide a scalable implementation of commonly used machine learning algorithms. Mahout is the machine learning tool of choice if the data to be used is large. What we generally mean by large is that the data cannot be processed on a single machine. With Big Data becoming an important focus area, Mahout fulfils the need for a machine learning tool that can scale beyond a single machine. The focus on scalability differentiates Mahout from other tools such as R, Weka, and so on.

The learning implementations in Mahout are written in Java, and major portions, but not all, are built upon Apache's Hadoop distributed computation project using the MapReduce paradigm. Efforts are on to build Mahout on Apache Spark using Scala DSL. Programs written in Scala DSL will be automatically optimized and executed in parallel on Apache Spark. Commits of new algorithms in MapReduce have been stopped and the existing MapReduce implementation will be supported.

The purpose of this chapter is to understand the fundamental concepts behind Mahout. In particular, we will cover the following topics:

Why Mahout
When Mahout
How Mahout

Why Mahout

We already have many good open source machine learning software tools. The statistical language R has a very large community, good IDE, and a large collection of machine learning packages already implemented. Python has a strong community and is multipurpose, and in Java we have Weka.

So what is the need for a new machine learning framework?

The answer lies in the scale of data. Organizations are generating terabytes of data daily and there is a need for a machine learning framework that can process that amount of data.

That begs a question, can't we just sample the data and use existing tools for our analytics use cases?

Simple techniques and more data is better

Collecting and processing data is much easier today than, say, a decade ago. IT infrastructure has seen an enormous improvement; ETL tools, click stream providers such as Google analytics, stream processing frameworks such as Kafka, Storm, and so on have made collecting data much easier. Platforms like Hadoop, Cassandra, and MPP databases such as Teradata have made storing and processing huge amount of data much easier than earlier. From a large-scale production algorithm standpoint, we have seen that simpler algorithms on very large amounts of data produce reasonably good results.

Sampling is difficult

Sampling may lead to over-fitting and increases the complexity of preparing data to build models to solve the problem at hand. Though sampling tends to simplify things by allowing scientists to work on a small sample instead of the whole population and helps in using existing tools like R to scale up to the task, getting a representative sample is tricky.

I'd say when you have the choice of getting more data, take it. Never discard data. Throw more (commodity) hardware at the data by using platforms and tools such as Hadoop and Mahout.

Community and license

Another advantage of Mahout is its license. Mahout is Apache licensed, which means that you can incorporate pieces of it into your own software regardless of whether you want to release your source code. However, other ML software, such as Weka, are under the GPL license, which means that incorporating them into your software forces you to release source code for any software you package with Weka components.

When Mahout

We have discussed the advantages of using Mahout, let's now discuss the scenarios where using Mahout is a good choice.

Data too large for single machine

If the data is too large to process on a single machine then it would be a good starting point to think about a distributed system. Rather than scaling and buying bigger hardware, it could be a better option to scale out, buy more machines, and distribute the processing.

Data already on Hadoop

A lot of enterprises have adopted Hadoop as their Big Data platform and have used it to store and aggregate data. Mahout has been designed to run algorithms on top of Hadoop and has a relatively straightforward configuration.

If your data or the bulk of it is already on Hadoop, then Mahout is a natural choice to run machine learning algorithms.

Algorithms implemented in Mahout

Do check whether the use case that needs to be implemented has a corresponding algorithm implemented in Mahout, or you have the required expertise to extend Mahout to implement your own algorithms.

How Mahout

In this section, you will learn how to install and configure Mahout.

Setting up the development environment

For any development work involving Mahout, and to follow the examples in this book, you will require the following setup:

Java 1.6 or higher
Maven 2.0 or higher
Hadoop 1.2.1
Eclipse with Maven plugin
Mahout 0.9

I prefer to try out the latest version, barring when there are known compatibility issues. To configure Hadoop, follow the instructions on this page http://hadoop.apache.org/docs/r1.2.1/single_node_setup.html. We will focus on configuring Maven, Eclipse with the Maven plugin, and Mahout.

Configuring Maven

Maven can be downloaded from one of the mirrors of the Apache website http://maven.apache.org/download.cgi. We use Apache Maven 3.2.5 and the same can be downloaded using this command:

wget http://apache.mirrors.tds.net/maven/maven-3/3.2.5/binaries/apache-maven-3.2.5-bin.tar.gz
cd /usr/local
sudo tar xzf $HOME/Downloads/ /usr/local/apache-maven-3.2.5-bin.tar.gz
sudo mv apache-maven-3.2.5 maven
sudo chown -R $USER maven

Configuring Mahout

Mahout can be configured to be run with or without Hadoop. Currently, efforts are on to port Mahout on Apache Spark but it is in a nascent stage. We will discuss Mahout on Spark in Chapter 8, New Paradigm in Mahout. In this chapter, you are going to learn how to configure Mahout on top of Hadoop.

We will have two configurations for Mahout. The first we will use for practicing command line examples of Mahout and the other, compiled from source, will be used to develop Mahout code using Java API and Eclipse.

Though we can use one Mahout configuration, I will take this opportunity to discuss both approaches.

Download the latest Mahout version using one of the mirrors listed at the Apache Mahout website https://mahout.apache.org/general/downloads.html. The current release version is mahout-distribution-0.9.tar.gz. After the download completes, the archive should be in the Downloads folder under the user's home directory. Type the following on the command line. The first command moves the shell prompt to the /usr/local directory:

cd /usr/local

Extract the downloaded file to the directory mahout-distribution-0.9.tar.gz under the /usr/local directory. The command tar is used to extract the archive:

sudo tar xzf $HOME/Downloads/mahout-distribution-0.9.tar.gz

The third command mv renames the directory from mahout-distribution-0.9 to mahout:

sudo mv mahout-distribution-0.9 mahout

The last command chown changes the ownership of the file from the root user to the current user. The Linux command chown is used for changing the ownership of files and directories. The argument –R instructs the chown command to recursively change the ownership of subdirectories and $USER holds the value of the logged in user's username:

sudo chown -R $USER mahout

We need to update the .bashrc file to export the required variables and update the $PATH variable:

cd $HOME
vi .bashrc

At the end of the file, copy the following statements:

#Statements related to Mahout
export MAVEN_HOME=/usr/local/maven
export MAHOUT_HOME=/usr/local/mahout
PATH=$PATH:/bin:$MAVEN_HOME/bin:$MAHOUT_HOME/bin
###end of mahout statement

Exit from all existing terminals, start a new terminal, and enter the following command:

echo $PATH

Check whether the output has the path recently added to Maven and Mahout.

Type the following commands on the command line; both commands should be recognized:

mvn –-version
mahout

Configuring Eclipse with the Maven plugin and Mahout

Download Eclipse from the Eclipse mirror mentioned on the home page. We have used Eclipse Kepler SR2 for this book. The downloaded archive should be in the Downloads folder of the user's home directory. Open a terminal and enter the following command:

cd /usr/local
sudo tar xzf $HOME/Downloads/eclipse-standard-kepler-SR2-linux-gtk-x86_64.tar.gz
sudo chown -R $USER eclipse

Go into the Eclipse directory and open up the Eclipse GUI. We will now install the Maven plugin. Click on Help then Eclipse Marketplace and then in the search panel type m2e and search. Once the search results are displayed hit Install and follow the instructions. To complete the installation hit the Next button and press the Accept button whenever prompted. Once the installation is done, Eclipse will prompt for a restart. Hit OK and let Eclipse restart.

Now to add Mahout dependency to any Maven project we need, add the following dependency in the pom.xml file:

    <dependency>
      <groupId>org.apache.mahout</groupId>
      <artifactId>mahout-core</artifactId>
      <version>0.9</version>
    </dependency>
    <dependency>
      <groupId>org.apache.mahout</groupId>
      <artifactId>mahout-examples</artifactId>
      <version>0.9</version>
    </dependency>
    <dependency>
      <groupId>org.apache.mahout</groupId>
      <artifactId>mahout-math</artifactId>
      <version>0.9</version>
    </dependency>

Eclipse will download and add all the dependencies.

Now we should import the code repository of this book to Eclipse. Open Eclipse and follow the following sequence of steps. The pom.xml file has all the dependencies included in it and Eclipse will download and resolve the dependencies.

Tip

Downloading the example code

You can download the example code files from your account at http://www.packtpub.com for all the Packt Publishing books you have purchased. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files e-mailed directly to you.

Mahout command line

Mahout provides an option for the command line execution of machine learning algorithms. Using the command line, an initial prototype of the model can be built quickly.

A few examples of command line are discussed. A great place to start is to go through Mahout's example scripts, the example scripts; are located under the Mahout home folder in the examples folder:

cd $MAHOUT_HOME
cd examples/bin
ls --ltr

The Mahout example scripts are as follows:

Open the file README.txt in vi editor and read the description of the scripts. We will be discussing them in the subsequent sections of this chapter:

vi README.txt

The description of the example script is as follows:

Tip

It is a good idea to try out a few command line Mahout algorithms before writing Mahout Java code. This way we can shortlist a few algorithms that might work on the given data and problem, and save a lot of time.

A clustering example

In this section, we will discuss the command line implementation of clustering in Mahout and use the example script as reference.

On the terminal please type:

vi cluster-reuters.sh

This script clusters the Reuters dataset using a variety of algorithms. It downloads the dataset automatically, parses and copies it to HDFS (Hadoop Distributed File System), and based upon user input, runs the corresponding clustering algorithm.

On the vi terminal type the command:

:set number

This will display the line numbers of the lines in the file. The algorithms implemented are kmeans, fuzzykmeans, lda, and streamingkmeans; line 42 of the code has a list of all algorithms implemented in the script:

algorithm=( kmeansfuzzykmeansldastreamingkmeans) #A list of all algorithms implemented in the script

Input is taken from the user in line 51 by the read statement:

read -p "Enter your choice : " choice

Line 57 sets the temp working directory variable:

WORK_DIR=/tmp/mahout-work-${USER}

On line 79, the curl statement downloads the Reuters data to the working directory, first checking whether the file is already present in the working directory between lines 70 to 73:

curl http://kdd.ics.uci.edu/databases/reuters21578/reuters21578.tar.gz -o ${WORK_DIR}/reuters21578.tar.gz

From line 89, the Reuters tar is extracted to the reuters-sgm folder under the working directory:

tar xzf ${WORK_DIR}/reuters21578.tar.gz -C ${WORK_DIR}/reuters-sgm

Reuter's raw data file

Let's have a look at one of the raw files. Open the reut2-000.sgm file in a text editor such as vi or gedit.

The Reuter's raw file looks like this:

The Reuters data is distributed in 22 files, each of which contains 1,000 documents, except for the last (reut2-021.sgm), which contains 578 documents. The files are in the SGML (standard generalized markup language) format, which is similar to XML. The SGML file needs to be parsed.

On line 93, the Reuters data is parsed using Lucene. Lucene has built-in classes and functions to process different file formats. The logic of parsing the Reuters dataset is implemented in the ExtractReuters class. The SGML file is parsed and the text elements are extracted from it.

Tip

Apache Lucene is a free/open source information retrieval software library.

We will use the ExtractReuters class to extract the sgm file to text format.

$MAHOUT org.apache.lucene.benchmark.utils.ExtractReuters ${WORK_DIR}/reuters-sgm ${WORK_DIR}/reuters-out

Now let's look at the Reuters processed file. The following figure is a snapshot taken from the text file extracted from the sgm files we saw previously by the ExtractReuters class:

On lines 95 to 101, data is loaded from a local directory to HDFS, deleting the reuters-sgm and reuters-out folders if they already exist:

  echo "Copying Reuters data to Hadoop"
  $HADOOP dfs -rmr ${WORK_DIR}/reuters-sgm
  $HADOOP dfs -rmr ${WORK_DIR}/reuters-out
  $HADOOP dfs -put ${WORK_DIR}/reuters-sgm ${WORK_DIR}/reuters-sgm
  $HADOOP dfs -put ${WORK_DIR}/reuters-out ${WORK_DIR}/reuters-out

On line 105, the files are converted into sequence files. Mahout works with sequence files.

Tip

Sequence files are the standard input of Mahout machine learning algorithms.

$MAHOUT seqdirectory -i ${WORK_DIR}/reuters-out -o ${WORK_DIR}/reuters-out-seqdir -c UTF-8 -chunk 64 -xm sequential

On lines 109 to 111, the sequence file is converted to a vector representation. Text needs to be converted into a vector representation so that a machine learning algorithm can process it. We will talk about text vectorization in details in Chapter 10, Case Study – Text Analytics.

 $MAHOUT seq2sparse -i ${WORK_DIR}/reuters-out-seqdir/ 
  -o ${WORK_DIR}/reuters-out-seqdir-sparse-kmeans --maxDFPercent 85 – namedVector

From here on, we will only explain the k-means algorithm execution; we encourage you to read and understand the other three implementations too. A detailed discussion of clustering will be covered in Chapter 7, Clustering with Mahout.

Clustering is the process of partitioning a bunch of data points into related groups called clusters. K-means clustering partitions a dataset into a specified number of clusters by minimizing the distance between each data point and the center of the cluster using a distance metric. A distance metric is a way to define how far or near a data point is from another. K-means requires users to provide the number of clusters and optionally user-defined cluster centroids.

To better understand how data points are clustered together, please have a look at the sample figure displaying three clusters. Notice that the points that are nearby are grouped together into three distinct clusters. A few points don't belong to any clusters, those points represent outliers and should be removed prior to clustering.

Here is an example of the command line for k-means clustering:

Parameter	Description
--input (-i)	This is the path to the job input directory.
--clusters (-c)	These are the input centroids and they must be a `SequenceFile` of type `Writable` or `Cluster/Canopy`. If `k` is also specified, then a random set of vectors will be selected and written out to this path first.
--output (-o)	This is the directory pathname for the output.
--distanceMeasure	This is the class name of `DistanceMeasure`; the default is `SquaredEuclidean`.
--convergenceDelta	This is the convergence delta value; the default is `0.5`.
--maxIter (-x)	This is the maximum number of iterations.
--maxRed (-r)	This is the number of reduce tasks; this defaults to `2`.
--k (-k)	This is the `k` in k-means. If specified, then a random selection of `k` vectors will be chosen as the `Centroid` and written to the cluster's input path.
--overwrite (-ow)	If this is present, overwrite the output directory before running the job.
--help (-h)	This prints out `Help`.
--clustering (-cl)	If this is present, run clustering after the iterations have taken place.

Lines 113 to 118 take the sparse matrix and runs the k-means clustering algorithm using the cosine distance metric. We pass –k the number of clusters as 20 and –x the maximum number of iterations as 10:

     $MAHOUT kmeans \
    -i ${WORK_DIR}/reuters-out-seqdir-sparse-kmeans/tfidf-vectors/ \
    -c ${WORK_DIR}/reuters-kmeans-clusters \
    -o ${WORK_DIR}/reuters-kmeans \
    -dm org.apache.mahout.common.distance.CosineDistanceMeasure \
    -x 10 -k 20 -ow --clustering \

Lines 120 to 125 take the cluster dump utility, read the clusters in sequence file format, and convert them to text files:

 $MAHOUT clusterdump \
    -i ${WORK_DIR}/reuters-kmeans/clusters-*-final \
    -o ${WORK_DIR}/reuters-kmeans/clusterdump \
    -d ${WORK_DIR}/reuters-out-seqdir-sparse-kmeans/dictionary.file-0 \
    -dt sequencefile -b 100 -n 20 --evaluate -dm org.apache.mahout.common.distance.CosineDistanceMeasure -sp 0 \ 
    --pointsDir ${WORK_DIR}/reuters-kmeans/clusteredPoints \
&& \    
  cat ${WORK_DIR}/reuters-kmeans/clusterdump

The clusterdump utility outputs the center of each cluster and the top terms in the cluster. A sample of the output is shown here:

A classification example

In this section, we will discuss the command line implementation of classification in Mahout and use the example script as a reference.

Classification is the task of identifying which set of predefined classes a data point belongs to. Classification involves training a model with a labeled (previously classified) dataset and then predicting new unlabeled data using that model. The common workflow for a classification problem is:

Data preparation
Train model
Test model
Performance measurement

Repeat steps until the desired performance is achieved, or the best possible solution is achieved or the project's time is up.

On the terminal, please type:

vi classify-20newsgroups.sh

On the vi terminal, type the following command to show the line numbers for lines in the script:

:set number

The algorithms implemented in the script are cnaivebayes, naivebayes, sgd, and a last option clean, which cleans up the work directory

Line 44 creates a working directory for the dataset and all input/output:

export WORK_DIR=/tmp/mahout-work-${USER}

Lines 64 to 74 download and extract the 20news-bydate.tar.gz file after making sure it is not already downloaded:

  if [ ! -e ${WORK_DIR}/20news-bayesinput ]; then
    if [ ! -e ${WORK_DIR}/20news-bydate ]; then
      if [ ! -f ${WORK_DIR}/20news-bydate.tar.gz ]; then
        echo "Downloading 20news-bydate"
        curl http://people.csail.mit.edu/jrennie/20Newsgroups/20news-bydate.tar.gz -o ${WORK_DIR}/20news-bydate.tar.gz
      fi
      mkdir -p ${WORK_DIR}/20news-bydate
      echo "Extracting..."
      cd ${WORK_DIR}/20news-bydate && tar xzf ../20news-bydate.tar.gz && cd ..&&cd ..
    fi
  fi

The 20 newsgroups dataset consists of messages, one per file. Each file begins with header lines that specify things such as who sent the message, how long it is, what kind of software was used, and the subject. A blank line follows, and then the message body follows as unformatted text.

Lines 90 to 101 prepare the directory and copy the data to the Hadoop directory:

  echo "Preparing 20newsgroups data"
  rm -rf ${WORK_DIR}/20news-all
  mkdir ${WORK_DIR}/20news-all
  cp -R ${WORK_DIR}/20news-bydate/*/* ${WORK_DIR}/20news-all

  if [ "$HADOOP_HOME" != "" ] && [ "$MAHOUT_LOCAL" == "" ] ; then
    echo "Copying 20newsgroups data to HDFS"
    set +e
    $HADOOP dfs -rmr ${WORK_DIR}/20news-all
    set -e
    $HADOOP dfs -put ${WORK_DIR}/20news-all ${WORK_DIR}/20news-all
  fi

A snapshot of the raw 20newsgroups data file is provided below.

Lines 104 to 106 convert the full 20 newsgroups dataset into sequence files:

$ mahout seqdirectory  -i ${WORK_DIR}/20news-all –o ${WORK_DIR}/20news-seq -ow

Lines 109 to 111 convert the sequence files to vectors calculating the term frequency and inverse document frequency. Term frequency and inverse document frequency are ways of representing text using numeric representation:

./bin/mahout seq2sparse \
-i ${WORK_DIR}/20news-seq \
-o ${WORK_DIR}/20news-vectors -lnorm -nv  -wt tfidf

Lines 114 to 118 split the preprocessed dataset into training and testing sets. The test set will be used to test the performance of the model trained using the training sets:

./bin/mahout split \
-i ${WORK_DIR}/20news-vectors/tfidf-vectors \
--trainingOutput ${WORK_DIR}/20news-train-vectors \
--testOutput ${WORK_DIR}/20news-test-vectors  \
--randomSelectionPct 40 --overwrite --sequenceFiles -xm sequential

Lines 120 to 125 train the classifier using the training sets:

./bin/mahout trainnb \
-i ${WORK_DIR}/20news-train-vectors -el \
-o ${WORK_DIR}/model \
-li ${WORK_DIR}/labelindex \
-ow $c

Lines 129 to 133 test the classifier using the test sets:

./bin/mahout testnb \
-i ${WORK_DIR}/20news-train-vectors\
-m ${WORK_DIR}/model \
-l ${WORK_DIR}/labelindex \
-ow -o ${WORK_DIR}/20news-testing $c

Mahout API – a Java program example

Though using Mahout from the command line is convenient, fast, and serves the purpose in many scenarios, learning the Mahout API is important too. The reason being, using the API gives you more flexibility in terms of creating your machine learning application, and not all algorithms can be easily called from the command line. Working with the Mahout API helps to understand the internals of a machine learning algorithm.

Mahout core JAR files have the implementation of the main machine learning classes and the Mahout examples JAR file has some example code and wrappers built around the Mahout core classes. It is worth spending time going through the documentation and getting an overall understanding. The documentation for the version you are using can be found in the Mahout installation directory.

The Mahout documentation directory looks like this:

We will now look at a Mahout code example. We will write a classification example in which we will train an algorithm to predict whether a client has subscribed to a term deposit. Classification refers to the process of labeling a particular instance or row to a particular predefined category, called a class label. The purpose of the following example is to give you a hang of the development using Mahout, Eclipse, and Maven.

The dataset

We will use the bank-additional-full.csv file present in the learningApacheMahout/data/chapter4 directory as the input for our example. First, let's have a look at the structure of the data and try to understand it. The following table shows various input variables along with their data types:

Column Name	Description	Variable Type
`Age`	Age of the client	Numeric
`Job`	Type of job, for example, entrepreneur, housemaid, or management	Categorical
`Marital`	Marital status	Categorical
`Education`	Education level	Categorical
`Default`	Has the client defaulted on credit?	Categorical
`Housing`	Does the client have housing loan?	Categorical
`Loan`	Does the client have personal loan?	Categorical
`Contact`	Contact communication type	Categorical
`Month`	Last contact month of year	Categorical
`day_of_week`	Last contact day of the week	Categorical
`duration`	Last contact duration, in seconds	Numeric
`campaign`	Number of contacts	Numeric
`Pdays`	Number of days that passed since last contact	Numeric
`previous`	Number of contacts performed before this campaign	Numeric
`poutcome`	outcome of the previous marketing campaign	Categorical
`emp.var.rate`	Employment variation rate - quarterly indicator	Numeric
`cons.price.idx`	Consumer price index - monthly indicator	Numeric
`cons.conf.idx`	Consumer confidence index - monthly indicator	Numeric
`euribor3m`	Euribor 3 month rate - daily indicator	Numeric
`nr.employed`	Number of employees - quarterly indicator	Numeric
`Y`	Has the client subscribed a term deposit	Categorical/target

Based on many attributes of the customer, we try to predict the target variable y (has the client subscribed to a term deposit?), which has a set of two predefined values, Yes and No. We need to remove the header line to use the data.

We will use logistic regression to build the model; logistic regression is a statistical technique that computes the probability of an unclassified item belonging to a predefined class.

You might like to run the example with the code in the source code that ships with this book; I will explain the important steps in the following section. In Eclipse, open the code file OnlineLogisticRegressionTrain.java from the package chapter4.logistic.src, which is present in the directory learningApacheMahout/src/main/java/chapter4/src/logistic in the code folder that comes with this book.

The first step is to identify the source and target folders:

String inputFile = "data/chapter4/train_data/input_bank_data.csv";
String outputFile = "data/chapter4/model";

Once we know where to get the data from, we need to tell the algorithm about how to interpret the data. We pass the column name and the corresponding column type; here, n denotes the numeric column and w, the categorical columns of the data:

List<String> predictorList =Arrays.asList("age","job","marital","education","default","housing","loan","contact","month","day_of_week","duration","campaign","pdays","previous","poutcome","emp.var.rate","cons.price.idx","cons.conf.idx","euribor3m","nr.employed");

List<String> typeList = Arrays.asList("n","w","w","w","w","w","w","w","w","w","n","n","n","n","w","n","n","n","n","n");

Set the classifier parameters. LogisticModelParameters is a wrapper class, in Mahout's example distribution, used to set the parameters for training logistic regression and to return the instance of a CsvRecordFactory:

LogisticModelParameters lmp = new LogisticModelParameters();
        lmp.setTargetVariable("y");
        lmp.setMaxTargetCategories(2);
        lmp.setNumFeatures(20);
        lmp.setUseBias(false);
        lmp.setTypeMap(predictorList,typeList);
        lmp.setLearningRate(0.5);
        int passes = 50;

We set the the target variable y to be used for training, the maximum number of target categories to be 2 (Yes, No), the number of features or columns in the data excluding the target variable (which is 20), and some other settings (which we will learn about later in this book). The variable passed has been given a value of 50, which means the maximum number of iteration over the data will be 50.

The CsvRecordFactory class returns an object to parse the CSV file based on the parameters passed. The LogisticModelParameters class takes care of passing the parameters to the constructor of CsvRecordFactory. We use the class RandomAccessSparseVector to encode the data into vectors and train the model using lr.train(targetValue, input):

CsvRecordFactory csv = lmp.getCsvRecordFactory();
lr = lmp.createRegression();
for (int pass = 0; pass < passes; pass++) {
                BufferedReader in = new BufferedReader(new FileReader(inputFile));

                
                csv.firstLine(in.readLine());

                String line = in.readLine();
                int lineCount = 2;
                while (line != null) {
                  
                  Vector input = new RandomAccessSparseVector(lmp.getNumFeatures());
                  int targetValue = csv.processLine(line, input);

                  // update model
                  lr.train(targetValue, input);
                  k++;

                  line = in.readLine();
                  lineCount++;
                }
                in.close();
              }

The output after running the code would be an equation denoting the logistic regression. Excerpts of the equation are copied here:

y ~ -97.230*age + -12.713*campaign + . . .

You will learn about logistic regression, how to interpret the equation, and how to evaluate the results in detail in Chapter 4, Classification with Mahout.

Parallel versus in-memory execution mode

Mahout has both parallel and in-memory execution for many machine learning algorithms. In-memory execution can be used when the data size is smaller or to try out different algorithms quickly without installing Hadoop. In-memory execution is restricted to one machine whereas the parallel are designed to run on different machines. The parallel execution is implemented over Hadoop using the MapReduce paradigm, and for parallel execution; we call the code via the driver class to run the Hadoop MapReduce job. Let's see which algorithms have single machine and parallel execution. We have grouped the algorithms according to the paradigm such as collaborative filtering, classification, and so on. The first column of the table is the name of the column, the second column indicates whether the algorithm has a single machine implementation, and the third column indicates whether the algorithm has a parallel execution implementation.

The collaborative filtering table is as follows:

Algorithm	Single machine	Parallel
User-based collaborative filtering	Y	N
Item-based collaborative filtering	Y	Y
Matrix factorization with alternating least squares	Y	Y
Matrix factorization with alternating least squares on implicit feedback	Y	Y
Weighted matrix factorization	Y	N

The classification table is as follows:

Algorithm	Single machine	Parallel
Logistic regression	Y	N
Naïve Bayes/Complementary naïve Bayes	N	Y
Random forest	N	Y
Hidden Markov models	Y	N
Multilayer perceptron	Y	N

The clustering table is as follows:

Algorithm	Single machine	Parallel
Canopy clustering	Y	Y
k-means clustering	Y	Y
Fuzzy k-means	Y	Y
Streaming k-means	Y	Y
Spectral clustering	N	Y

The dimensionality reduction table is as follows:

Algorithm	Single machine	Parallel
Singular value decomposition	Y	N
Lanczos algorithm	Y	Y
Stochastic SVD	Y	Y
Principal component analysis	Y	Y

The topic models table is as follows:

Algorithm	Single machine	Parallel
Latent Dirichlet allocation	Y	Y

The miscellaneous table is as follows:

Algorithm	Single machine	Parallel
Frequent pattern mining	N	Y
RowSimilarityJob	N	Y
ConcatMatrices	N	Y
Collocations	N	Y

Summary

In this chapter, we discussed the guiding principle of Mahout and tried out some examples to get a hands-on feel of Mahout. We discussed why, when, and how to use Mahout and walked through the installation steps of the required tools and software. We then learned how to use Mahout from the command line and from the code, and finally concluded with a comparison between the parallel and the single-machine execution of Mahout.

This is the beginning of what will hopefully be an exciting journey. In the forthcoming chapters, we will discuss a lot of practical applications for Mahout. In the next chapter, we will discuss the core concepts of machine learning. A clear understanding of the concepts of different machine learning algorithms is of paramount importance for a successful career in data analytics.