Search icon
Arrow left icon
All Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletters
Free Learning
Arrow right icon
Learning Apache Mahout
Learning Apache Mahout

Learning Apache Mahout: Acquire practical skills in Big Data Analytics and explore data science with Apache Mahout

eBook
$39.99 $27.98
Print
$48.99
Subscription
$15.99 Monthly

What do you get with Print?

Product feature icon Instant access to your digital eBook copy whilst your Print order is Shipped
Product feature icon Black & white paperback book shipped to your address
Product feature icon Download this book in EPUB and PDF formats
Product feature icon Access this title in our online reader with advanced features
Product feature icon DRM FREE - Read whenever, wherever and however you want
Buy Now

Product Details


Publication date : Mar 30, 2015
Length 250 pages
Edition : 1st Edition
Language : English
ISBN-13 : 9781783555215
Vendor :
Apache
Category :
Table of content icon View table of contents Preview book icon Preview Book

Learning Apache Mahout

Chapter 1. Introduction to Mahout

Mahout is an open source machine learning library from Apache. Mahout primarily implements clustering, recommender engines (collaborative filtering), classification, and dimensionality reduction algorithms but is not limited to these.

The aim of Mahout is to provide a scalable implementation of commonly used machine learning algorithms. Mahout is the machine learning tool of choice if the data to be used is large. What we generally mean by large is that the data cannot be processed on a single machine. With Big Data becoming an important focus area, Mahout fulfils the need for a machine learning tool that can scale beyond a single machine. The focus on scalability differentiates Mahout from other tools such as R, Weka, and so on.

The learning implementations in Mahout are written in Java, and major portions, but not all, are built upon Apache's Hadoop distributed computation project using the MapReduce paradigm. Efforts are on to build Mahout on Apache Spark using Scala DSL. Programs written in Scala DSL will be automatically optimized and executed in parallel on Apache Spark. Commits of new algorithms in MapReduce have been stopped and the existing MapReduce implementation will be supported.

The purpose of this chapter is to understand the fundamental concepts behind Mahout. In particular, we will cover the following topics:

  • Why Mahout

  • When Mahout

  • How Mahout

Why Mahout


We already have many good open source machine learning software tools. The statistical language R has a very large community, good IDE, and a large collection of machine learning packages already implemented. Python has a strong community and is multipurpose, and in Java we have Weka.

So what is the need for a new machine learning framework?

The answer lies in the scale of data. Organizations are generating terabytes of data daily and there is a need for a machine learning framework that can process that amount of data.

That begs a question, can't we just sample the data and use existing tools for our analytics use cases?

Simple techniques and more data is better

Collecting and processing data is much easier today than, say, a decade ago. IT infrastructure has seen an enormous improvement; ETL tools, click stream providers such as Google analytics, stream processing frameworks such as Kafka, Storm, and so on have made collecting data much easier. Platforms like Hadoop, Cassandra, and MPP databases such as Teradata have made storing and processing huge amount of data much easier than earlier. From a large-scale production algorithm standpoint, we have seen that simpler algorithms on very large amounts of data produce reasonably good results.

Sampling is difficult

Sampling may lead to over-fitting and increases the complexity of preparing data to build models to solve the problem at hand. Though sampling tends to simplify things by allowing scientists to work on a small sample instead of the whole population and helps in using existing tools like R to scale up to the task, getting a representative sample is tricky.

I'd say when you have the choice of getting more data, take it. Never discard data. Throw more (commodity) hardware at the data by using platforms and tools such as Hadoop and Mahout.

Community and license

Another advantage of Mahout is its license. Mahout is Apache licensed, which means that you can incorporate pieces of it into your own software regardless of whether you want to release your source code. However, other ML software, such as Weka, are under the GPL license, which means that incorporating them into your software forces you to release source code for any software you package with Weka components.

When Mahout


We have discussed the advantages of using Mahout, let's now discuss the scenarios where using Mahout is a good choice.

Data too large for single machine

If the data is too large to process on a single machine then it would be a good starting point to think about a distributed system. Rather than scaling and buying bigger hardware, it could be a better option to scale out, buy more machines, and distribute the processing.

Data already on Hadoop

A lot of enterprises have adopted Hadoop as their Big Data platform and have used it to store and aggregate data. Mahout has been designed to run algorithms on top of Hadoop and has a relatively straightforward configuration.

If your data or the bulk of it is already on Hadoop, then Mahout is a natural choice to run machine learning algorithms.

Algorithms implemented in Mahout

Do check whether the use case that needs to be implemented has a corresponding algorithm implemented in Mahout, or you have the required expertise to extend Mahout to implement your own algorithms.

How Mahout


In this section, you will learn how to install and configure Mahout.

Setting up the development environment

For any development work involving Mahout, and to follow the examples in this book, you will require the following setup:

  • Java 1.6 or higher

  • Maven 2.0 or higher

  • Hadoop 1.2.1

  • Eclipse with Maven plugin

  • Mahout 0.9

I prefer to try out the latest version, barring when there are known compatibility issues. To configure Hadoop, follow the instructions on this page http://hadoop.apache.org/docs/r1.2.1/single_node_setup.html. We will focus on configuring Maven, Eclipse with the Maven plugin, and Mahout.

Configuring Maven

Maven can be downloaded from one of the mirrors of the Apache website http://maven.apache.org/download.cgi. We use Apache Maven 3.2.5 and the same can be downloaded using this command:

wget http://apache.mirrors.tds.net/maven/maven-3/3.2.5/binaries/apache-maven-3.2.5-bin.tar.gz
cd /usr/local
sudo tar xzf $HOME/Downloads/ /usr/local/apache-maven-3.2.5-bin.tar.gz
sudo mv apache-maven-3.2.5 maven
sudo chown -R $USER maven

Configuring Mahout

Mahout can be configured to be run with or without Hadoop. Currently, efforts are on to port Mahout on Apache Spark but it is in a nascent stage. We will discuss Mahout on Spark in Chapter 8, New Paradigm in Mahout. In this chapter, you are going to learn how to configure Mahout on top of Hadoop.

We will have two configurations for Mahout. The first we will use for practicing command line examples of Mahout and the other, compiled from source, will be used to develop Mahout code using Java API and Eclipse.

Though we can use one Mahout configuration, I will take this opportunity to discuss both approaches.

Download the latest Mahout version using one of the mirrors listed at the Apache Mahout website https://mahout.apache.org/general/downloads.html. The current release version is mahout-distribution-0.9.tar.gz. After the download completes, the archive should be in the Downloads folder under the user's home directory. Type the following on the command line. The first command moves the shell prompt to the /usr/local directory:

cd /usr/local

Extract the downloaded file to the directory mahout-distribution-0.9.tar.gz under the /usr/local directory. The command tar is used to extract the archive:

sudo tar xzf $HOME/Downloads/mahout-distribution-0.9.tar.gz

The third command mv renames the directory from mahout-distribution-0.9 to mahout:

sudo mv mahout-distribution-0.9 mahout

The last command chown changes the ownership of the file from the root user to the current user. The Linux command chown is used for changing the ownership of files and directories. The argument –R instructs the chown command to recursively change the ownership of subdirectories and $USER holds the value of the logged in user's username:

sudo chown -R $USER mahout

We need to update the .bashrc file to export the required variables and update the $PATH variable:

cd $HOME
vi .bashrc

At the end of the file, copy the following statements:

#Statements related to Mahout
export MAVEN_HOME=/usr/local/maven
export MAHOUT_HOME=/usr/local/mahout
PATH=$PATH:/bin:$MAVEN_HOME/bin:$MAHOUT_HOME/bin
###end of mahout statement

Exit from all existing terminals, start a new terminal, and enter the following command:

echo $PATH

Check whether the output has the path recently added to Maven and Mahout.

Type the following commands on the command line; both commands should be recognized:

mvn –-version
mahout

Configuring Eclipse with the Maven plugin and Mahout

Download Eclipse from the Eclipse mirror mentioned on the home page. We have used Eclipse Kepler SR2 for this book. The downloaded archive should be in the Downloads folder of the user's home directory. Open a terminal and enter the following command:

cd /usr/local
sudo tar xzf $HOME/Downloads/eclipse-standard-kepler-SR2-linux-gtk-x86_64.tar.gz
sudo chown -R $USER eclipse

Go into the Eclipse directory and open up the Eclipse GUI. We will now install the Maven plugin. Click on Help then Eclipse Marketplace and then in the search panel type m2e and search. Once the search results are displayed hit Install and follow the instructions. To complete the installation hit the Next button and press the Accept button whenever prompted. Once the installation is done, Eclipse will prompt for a restart. Hit OK and let Eclipse restart.

Now to add Mahout dependency to any Maven project we need, add the following dependency in the pom.xml file:

    <dependency>
      <groupId>org.apache.mahout</groupId>
      <artifactId>mahout-core</artifactId>
      <version>0.9</version>
    </dependency>
    <dependency>
      <groupId>org.apache.mahout</groupId>
      <artifactId>mahout-examples</artifactId>
      <version>0.9</version>
    </dependency>
    <dependency>
      <groupId>org.apache.mahout</groupId>
      <artifactId>mahout-math</artifactId>
      <version>0.9</version>
    </dependency>

Eclipse will download and add all the dependencies.

Now we should import the code repository of this book to Eclipse. Open Eclipse and follow the following sequence of steps. The pom.xml file has all the dependencies included in it and Eclipse will download and resolve the dependencies.

Go to File | Import | Maven | Existing Maven Projects | Next | Browse to the location of the source folder that comes with this book | Finish.

Tip

Downloading the example code

You can download the example code files from your account at http://www.packtpub.com for all the Packt Publishing books you have purchased. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files e-mailed directly to you.

Mahout command line

Mahout provides an option for the command line execution of machine learning algorithms. Using the command line, an initial prototype of the model can be built quickly.

A few examples of command line are discussed. A great place to start is to go through Mahout's example scripts, the example scripts; are located under the Mahout home folder in the examples folder:

cd $MAHOUT_HOME
cd examples/bin
ls --ltr

The Mahout example scripts are as follows:

Open the file README.txt in vi editor and read the description of the scripts. We will be discussing them in the subsequent sections of this chapter:

vi README.txt

The description of the example script is as follows:

Tip

It is a good idea to try out a few command line Mahout algorithms before writing Mahout Java code. This way we can shortlist a few algorithms that might work on the given data and problem, and save a lot of time.

A clustering example

In this section, we will discuss the command line implementation of clustering in Mahout and use the example script as reference.

On the terminal please type:

vi cluster-reuters.sh

This script clusters the Reuters dataset using a variety of algorithms. It downloads the dataset automatically, parses and copies it to HDFS (Hadoop Distributed File System), and based upon user input, runs the corresponding clustering algorithm.

On the vi terminal type the command:

:set number

This will display the line numbers of the lines in the file. The algorithms implemented are kmeans, fuzzykmeans, lda, and streamingkmeans; line 42 of the code has a list of all algorithms implemented in the script:

algorithm=( kmeansfuzzykmeansldastreamingkmeans) #A list of all algorithms implemented in the script

Input is taken from the user in line 51 by the read statement:

read -p "Enter your choice : " choice

Line 57 sets the temp working directory variable:

WORK_DIR=/tmp/mahout-work-${USER}

On line 79, the curl statement downloads the Reuters data to the working directory, first checking whether the file is already present in the working directory between lines 70 to 73:

curl http://kdd.ics.uci.edu/databases/reuters21578/reuters21578.tar.gz -o ${WORK_DIR}/reuters21578.tar.gz

From line 89, the Reuters tar is extracted to the reuters-sgm folder under the working directory:

tar xzf ${WORK_DIR}/reuters21578.tar.gz -C ${WORK_DIR}/reuters-sgm
Reuter's raw data file

Let's have a look at one of the raw files. Open the reut2-000.sgm file in a text editor such as vi or gedit.

The Reuter's raw file looks like this:

The Reuters data is distributed in 22 files, each of which contains 1,000 documents, except for the last (reut2-021.sgm), which contains 578 documents. The files are in the SGML (standard generalized markup language) format, which is similar to XML. The SGML file needs to be parsed.

On line 93, the Reuters data is parsed using Lucene. Lucene has built-in classes and functions to process different file formats. The logic of parsing the Reuters dataset is implemented in the ExtractReuters class. The SGML file is parsed and the text elements are extracted from it.

Tip

Apache Lucene is a free/open source information retrieval software library.

We will use the ExtractReuters class to extract the sgm file to text format.

$MAHOUT org.apache.lucene.benchmark.utils.ExtractReuters ${WORK_DIR}/reuters-sgm ${WORK_DIR}/reuters-out

Now let's look at the Reuters processed file. The following figure is a snapshot taken from the text file extracted from the sgm files we saw previously by the ExtractReuters class:

On lines 95 to 101, data is loaded from a local directory to HDFS, deleting the reuters-sgm and reuters-out folders if they already exist:

  echo "Copying Reuters data to Hadoop"
  $HADOOP dfs -rmr ${WORK_DIR}/reuters-sgm
  $HADOOP dfs -rmr ${WORK_DIR}/reuters-out
  $HADOOP dfs -put ${WORK_DIR}/reuters-sgm ${WORK_DIR}/reuters-sgm
  $HADOOP dfs -put ${WORK_DIR}/reuters-out ${WORK_DIR}/reuters-out

On line 105, the files are converted into sequence files. Mahout works with sequence files.

Tip

Sequence files are the standard input of Mahout machine learning algorithms.

$MAHOUT seqdirectory -i ${WORK_DIR}/reuters-out -o ${WORK_DIR}/reuters-out-seqdir -c UTF-8 -chunk 64 -xm sequential

On lines 109 to 111, the sequence file is converted to a vector representation. Text needs to be converted into a vector representation so that a machine learning algorithm can process it. We will talk about text vectorization in details in Chapter 10, Case Study – Text Analytics.

 $MAHOUT seq2sparse -i ${WORK_DIR}/reuters-out-seqdir/ 
  -o ${WORK_DIR}/reuters-out-seqdir-sparse-kmeans --maxDFPercent 85 – namedVector

From here on, we will only explain the k-means algorithm execution; we encourage you to read and understand the other three implementations too. A detailed discussion of clustering will be covered in Chapter 7, Clustering with Mahout.

Clustering is the process of partitioning a bunch of data points into related groups called clusters. K-means clustering partitions a dataset into a specified number of clusters by minimizing the distance between each data point and the center of the cluster using a distance metric. A distance metric is a way to define how far or near a data point is from another. K-means requires users to provide the number of clusters and optionally user-defined cluster centroids.

To better understand how data points are clustered together, please have a look at the sample figure displaying three clusters. Notice that the points that are nearby are grouped together into three distinct clusters. A few points don't belong to any clusters, those points represent outliers and should be removed prior to clustering.

Here is an example of the command line for k-means clustering:

Parameter

Description

--input (-i)

This is the path to the job input directory.

--clusters (-c)

These are the input centroids and they must be a SequenceFile of type Writable or Cluster/Canopy. If k is also specified, then a random set of vectors will be selected and written out to this path first.

--output (-o)

This is the directory pathname for the output.

--distanceMeasure

This is the class name of DistanceMeasure; the default is SquaredEuclidean.

--convergenceDelta

This is the convergence delta value; the default is 0.5.

--maxIter (-x)

This is the maximum number of iterations.

--maxRed (-r)

This is the number of reduce tasks; this defaults to 2.

--k (-k)

This is the k in k-means. If specified, then a random selection of k vectors will be chosen as the Centroid and written to the cluster's input path.

--overwrite (-ow)

If this is present, overwrite the output directory before running the job.

--help (-h)

This prints out Help.

--clustering (-cl)

If this is present, run clustering after the iterations have taken place.

Lines 113 to 118 take the sparse matrix and runs the k-means clustering algorithm using the cosine distance metric. We pass –k the number of clusters as 20 and –x the maximum number of iterations as 10:

     $MAHOUT kmeans \
    -i ${WORK_DIR}/reuters-out-seqdir-sparse-kmeans/tfidf-vectors/ \
    -c ${WORK_DIR}/reuters-kmeans-clusters \
    -o ${WORK_DIR}/reuters-kmeans \
    -dm org.apache.mahout.common.distance.CosineDistanceMeasure \
    -x 10 -k 20 -ow --clustering \

Lines 120 to 125 take the cluster dump utility, read the clusters in sequence file format, and convert them to text files:

 $MAHOUT clusterdump \
    -i ${WORK_DIR}/reuters-kmeans/clusters-*-final \
    -o ${WORK_DIR}/reuters-kmeans/clusterdump \
    -d ${WORK_DIR}/reuters-out-seqdir-sparse-kmeans/dictionary.file-0 \
    -dt sequencefile -b 100 -n 20 --evaluate -dm org.apache.mahout.common.distance.CosineDistanceMeasure -sp 0 \ 
    --pointsDir ${WORK_DIR}/reuters-kmeans/clusteredPoints \
&& \    
  cat ${WORK_DIR}/reuters-kmeans/clusterdump

The clusterdump utility outputs the center of each cluster and the top terms in the cluster. A sample of the output is shown here:

A classification example

In this section, we will discuss the command line implementation of classification in Mahout and use the example script as a reference.

Classification is the task of identifying which set of predefined classes a data point belongs to. Classification involves training a model with a labeled (previously classified) dataset and then predicting new unlabeled data using that model. The common workflow for a classification problem is:

  1. Data preparation

  2. Train model

  3. Test model

  4. Performance measurement

Repeat steps until the desired performance is achieved, or the best possible solution is achieved or the project's time is up.

On the terminal, please type:

vi classify-20newsgroups.sh

On the vi terminal, type the following command to show the line numbers for lines in the script:

:set number

The algorithms implemented in the script are cnaivebayes, naivebayes, sgd, and a last option clean, which cleans up the work directory

Line 44 creates a working directory for the dataset and all input/output:

export WORK_DIR=/tmp/mahout-work-${USER}

Lines 64 to 74 download and extract the 20news-bydate.tar.gz file after making sure it is not already downloaded:

  if [ ! -e ${WORK_DIR}/20news-bayesinput ]; then
    if [ ! -e ${WORK_DIR}/20news-bydate ]; then
      if [ ! -f ${WORK_DIR}/20news-bydate.tar.gz ]; then
        echo "Downloading 20news-bydate"
        curl http://people.csail.mit.edu/jrennie/20Newsgroups/20news-bydate.tar.gz -o ${WORK_DIR}/20news-bydate.tar.gz
      fi
      mkdir -p ${WORK_DIR}/20news-bydate
      echo "Extracting..."
      cd ${WORK_DIR}/20news-bydate && tar xzf ../20news-bydate.tar.gz && cd ..&&cd ..
    fi
  fi

The 20 newsgroups dataset consists of messages, one per file. Each file begins with header lines that specify things such as who sent the message, how long it is, what kind of software was used, and the subject. A blank line follows, and then the message body follows as unformatted text.

Lines 90 to 101 prepare the directory and copy the data to the Hadoop directory:

  echo "Preparing 20newsgroups data"
  rm -rf ${WORK_DIR}/20news-all
  mkdir ${WORK_DIR}/20news-all
  cp -R ${WORK_DIR}/20news-bydate/*/* ${WORK_DIR}/20news-all

  if [ "$HADOOP_HOME" != "" ] && [ "$MAHOUT_LOCAL" == "" ] ; then
    echo "Copying 20newsgroups data to HDFS"
    set +e
    $HADOOP dfs -rmr ${WORK_DIR}/20news-all
    set -e
    $HADOOP dfs -put ${WORK_DIR}/20news-all ${WORK_DIR}/20news-all
  fi

A snapshot of the raw 20newsgroups data file is provided below.

Lines 104 to 106 convert the full 20 newsgroups dataset into sequence files:

$ mahout seqdirectory  -i ${WORK_DIR}/20news-all –o ${WORK_DIR}/20news-seq -ow

Lines 109 to 111 convert the sequence files to vectors calculating the term frequency and inverse document frequency. Term frequency and inverse document frequency are ways of representing text using numeric representation:

./bin/mahout seq2sparse \
-i ${WORK_DIR}/20news-seq \
-o ${WORK_DIR}/20news-vectors -lnorm -nv  -wt tfidf

Lines 114 to 118 split the preprocessed dataset into training and testing sets. The test set will be used to test the performance of the model trained using the training sets:

./bin/mahout split \
-i ${WORK_DIR}/20news-vectors/tfidf-vectors \
--trainingOutput ${WORK_DIR}/20news-train-vectors \
--testOutput ${WORK_DIR}/20news-test-vectors  \
--randomSelectionPct 40 --overwrite --sequenceFiles -xm sequential

Lines 120 to 125 train the classifier using the training sets:

./bin/mahout trainnb \
-i ${WORK_DIR}/20news-train-vectors -el \
-o ${WORK_DIR}/model \
-li ${WORK_DIR}/labelindex \
-ow $c

Lines 129 to 133 test the classifier using the test sets:

./bin/mahout testnb \
-i ${WORK_DIR}/20news-train-vectors\
-m ${WORK_DIR}/model \
-l ${WORK_DIR}/labelindex \
-ow -o ${WORK_DIR}/20news-testing $c

Mahout API – a Java program example

Though using Mahout from the command line is convenient, fast, and serves the purpose in many scenarios, learning the Mahout API is important too. The reason being, using the API gives you more flexibility in terms of creating your machine learning application, and not all algorithms can be easily called from the command line. Working with the Mahout API helps to understand the internals of a machine learning algorithm.

Mahout core JAR files have the implementation of the main machine learning classes and the Mahout examples JAR file has some example code and wrappers built around the Mahout core classes. It is worth spending time going through the documentation and getting an overall understanding. The documentation for the version you are using can be found in the Mahout installation directory.

The Mahout documentation directory looks like this:

We will now look at a Mahout code example. We will write a classification example in which we will train an algorithm to predict whether a client has subscribed to a term deposit. Classification refers to the process of labeling a particular instance or row to a particular predefined category, called a class label. The purpose of the following example is to give you a hang of the development using Mahout, Eclipse, and Maven.

The dataset

We will use the bank-additional-full.csv file present in the learningApacheMahout/data/chapter4 directory as the input for our example. First, let's have a look at the structure of the data and try to understand it. The following table shows various input variables along with their data types:

Column Name

Description

Variable Type

Age

Age of the client

Numeric

Job

Type of job, for example, entrepreneur, housemaid, or management

Categorical

Marital

Marital status

Categorical

Education

Education level

Categorical

Default

Has the client defaulted on credit?

Categorical

Housing

Does the client have housing loan?

Categorical

Loan

Does the client have personal loan?

Categorical

Contact

Contact communication type

Categorical

Month

Last contact month of year

Categorical

day_of_week

Last contact day of the week

Categorical

duration

Last contact duration, in seconds

Numeric

campaign

Number of contacts

Numeric

Pdays

Number of days that passed since last contact

Numeric

previous

Number of contacts performed before this campaign

Numeric

poutcome

outcome of the previous marketing campaign

Categorical

emp.var.rate

Employment variation rate - quarterly indicator

Numeric

cons.price.idx

Consumer price index - monthly indicator

Numeric

cons.conf.idx

Consumer confidence index - monthly indicator

Numeric

euribor3m

Euribor 3 month rate - daily indicator

Numeric

nr.employed

Number of employees - quarterly indicator

Numeric

Y

Has the client subscribed a term deposit

Categorical/target

Based on many attributes of the customer, we try to predict the target variable y (has the client subscribed to a term deposit?), which has a set of two predefined values, Yes and No. We need to remove the header line to use the data.

We will use logistic regression to build the model; logistic regression is a statistical technique that computes the probability of an unclassified item belonging to a predefined class.

You might like to run the example with the code in the source code that ships with this book; I will explain the important steps in the following section. In Eclipse, open the code file OnlineLogisticRegressionTrain.java from the package chapter4.logistic.src, which is present in the directory learningApacheMahout/src/main/java/chapter4/src/logistic in the code folder that comes with this book.

The first step is to identify the source and target folders:

String inputFile = "data/chapter4/train_data/input_bank_data.csv";
String outputFile = "data/chapter4/model";

Once we know where to get the data from, we need to tell the algorithm about how to interpret the data. We pass the column name and the corresponding column type; here, n denotes the numeric column and w, the categorical columns of the data:

List<String> predictorList =Arrays.asList("age","job","marital","education","default","housing","loan","contact","month","day_of_week","duration","campaign","pdays","previous","poutcome","emp.var.rate","cons.price.idx","cons.conf.idx","euribor3m","nr.employed");

List<String> typeList = Arrays.asList("n","w","w","w","w","w","w","w","w","w","n","n","n","n","w","n","n","n","n","n");

Set the classifier parameters. LogisticModelParameters is a wrapper class, in Mahout's example distribution, used to set the parameters for training logistic regression and to return the instance of a CsvRecordFactory:

LogisticModelParameters lmp = new LogisticModelParameters();
        lmp.setTargetVariable("y");
        lmp.setMaxTargetCategories(2);
        lmp.setNumFeatures(20);
        lmp.setUseBias(false);
        lmp.setTypeMap(predictorList,typeList);
        lmp.setLearningRate(0.5);
        int passes = 50;

We set the the target variable y to be used for training, the maximum number of target categories to be 2 (Yes, No), the number of features or columns in the data excluding the target variable (which is 20), and some other settings (which we will learn about later in this book). The variable passed has been given a value of 50, which means the maximum number of iteration over the data will be 50.

The CsvRecordFactory class returns an object to parse the CSV file based on the parameters passed. The LogisticModelParameters class takes care of passing the parameters to the constructor of CsvRecordFactory. We use the class RandomAccessSparseVector to encode the data into vectors and train the model using lr.train(targetValue, input):

CsvRecordFactory csv = lmp.getCsvRecordFactory();
lr = lmp.createRegression();
for (int pass = 0; pass < passes; pass++) {
                BufferedReader in = new BufferedReader(new FileReader(inputFile));

                
                csv.firstLine(in.readLine());

                String line = in.readLine();
                int lineCount = 2;
                while (line != null) {
                  
                  Vector input = new RandomAccessSparseVector(lmp.getNumFeatures());
                  int targetValue = csv.processLine(line, input);

                  // update model
                  lr.train(targetValue, input);
                  k++;

                  line = in.readLine();
                  lineCount++;
                }
                in.close();
              }

The output after running the code would be an equation denoting the logistic regression. Excerpts of the equation are copied here:

y ~ -97.230*age + -12.713*campaign + . . .

You will learn about logistic regression, how to interpret the equation, and how to evaluate the results in detail in Chapter 4, Classification with Mahout.

Parallel versus in-memory execution mode

Mahout has both parallel and in-memory execution for many machine learning algorithms. In-memory execution can be used when the data size is smaller or to try out different algorithms quickly without installing Hadoop. In-memory execution is restricted to one machine whereas the parallel are designed to run on different machines. The parallel execution is implemented over Hadoop using the MapReduce paradigm, and for parallel execution; we call the code via the driver class to run the Hadoop MapReduce job. Let's see which algorithms have single machine and parallel execution. We have grouped the algorithms according to the paradigm such as collaborative filtering, classification, and so on. The first column of the table is the name of the column, the second column indicates whether the algorithm has a single machine implementation, and the third column indicates whether the algorithm has a parallel execution implementation.

The collaborative filtering table is as follows:

Algorithm

Single machine

Parallel

User-based collaborative filtering

Y

N

Item-based collaborative filtering

Y

Y

Matrix factorization with alternating least squares

Y

Y

Matrix factorization with alternating least squares on implicit feedback

Y

Y

Weighted matrix factorization

Y

N

The classification table is as follows:

Algorithm

Single machine

Parallel

Logistic regression

Y

N

Naïve Bayes/Complementary naïve Bayes

N

Y

Random forest

N

Y

Hidden Markov models

Y

N

Multilayer perceptron

Y

N

The clustering table is as follows:

Algorithm

Single machine

Parallel

Canopy clustering

Y

Y

k-means clustering

Y

Y

Fuzzy k-means

Y

Y

Streaming k-means

Y

Y

Spectral clustering

N

Y

The dimensionality reduction table is as follows:

Algorithm

Single machine

Parallel

Singular value decomposition

Y

N

Lanczos algorithm

Y

Y

Stochastic SVD

Y

Y

Principal component analysis

Y

Y

The topic models table is as follows:

Algorithm

Single machine

Parallel

Latent Dirichlet allocation

Y

Y

The miscellaneous table is as follows:

Algorithm

Single machine

Parallel

Frequent pattern mining

N

Y

RowSimilarityJob

N

Y

ConcatMatrices

N

Y

Collocations

N

Y

Summary


In this chapter, we discussed the guiding principle of Mahout and tried out some examples to get a hands-on feel of Mahout. We discussed why, when, and how to use Mahout and walked through the installation steps of the required tools and software. We then learned how to use Mahout from the command line and from the code, and finally concluded with a comparison between the parallel and the single-machine execution of Mahout.

This is the beginning of what will hopefully be an exciting journey. In the forthcoming chapters, we will discuss a lot of practical applications for Mahout. In the next chapter, we will discuss the core concepts of machine learning. A clear understanding of the concepts of different machine learning algorithms is of paramount importance for a successful career in data analytics.

Left arrow icon Right arrow icon

Key benefits

What you will learn

Configure Mahout on Linux systems and set up the development environment Become familiar with the Mahout command line utilities and Java APIs Understand the core concepts of machine learning and the classes that implement them Integrate Apache Mahout with newer platforms such as Apache Spark Solve classification, clustering, and recommendation problems with Mahout Explore frequent pattern mining and topic modeling, the two main application areas of machine learning Understand feature extraction, reduction, and the curse of dimensionality

What do you get with Print?

Product feature icon Instant access to your digital eBook copy whilst your Print order is Shipped
Product feature icon Black & white paperback book shipped to your address
Product feature icon Download this book in EPUB and PDF formats
Product feature icon Access this title in our online reader with advanced features
Product feature icon DRM FREE - Read whenever, wherever and however you want
Buy Now

Product Details


Publication date : Mar 30, 2015
Length 250 pages
Edition : 1st Edition
Language : English
ISBN-13 : 9781783555215
Vendor :
Apache
Category :

Table of Contents

17 Chapters
Learning Apache Mahout Chevron down icon Chevron up icon
Credits Chevron down icon Chevron up icon
About the Author Chevron down icon Chevron up icon
About the Reviewers Chevron down icon Chevron up icon
www.PacktPub.com Chevron down icon Chevron up icon
Preface Chevron down icon Chevron up icon
Introduction to Mahout Chevron down icon Chevron up icon
Core Concepts in Machine Learning Chevron down icon Chevron up icon
Feature Engineering Chevron down icon Chevron up icon
Classification with Mahout Chevron down icon Chevron up icon
Frequent Pattern Mining and Topic Modeling Chevron down icon Chevron up icon
Recommendation with Mahout Chevron down icon Chevron up icon
Clustering with Mahout Chevron down icon Chevron up icon
New Paradigm in Mahout Chevron down icon Chevron up icon
Case Study – Churn Analytics and Customer Segmentation Chevron down icon Chevron up icon
Case Study – Text Analytics Chevron down icon Chevron up icon
Index Chevron down icon Chevron up icon

Customer reviews

Filter icon Filter
Top Reviews
Rating distribution
Empty star icon Empty star icon Empty star icon Empty star icon Empty star icon 0
(0 Ratings)
5 star 0%
4 star 0%
3 star 0%
2 star 0%
1 star 0%

Filter reviews by


No reviews found
Get free access to Packt library with over 7500+ books and video courses for 7 days!
Start Free Trial

FAQs

What is the delivery time and cost of print book? Chevron down icon Chevron up icon

Shipping Details

USA:

'

Economy: Delivery to most addresses in the US within 10-15 business days

Premium: Trackable Delivery to most addresses in the US within 3-8 business days

UK:

Economy: Delivery to most addresses in the U.K. within 7-9 business days.
Shipments are not trackable

Premium: Trackable delivery to most addresses in the U.K. within 3-4 business days!
Add one extra business day for deliveries to Northern Ireland and Scottish Highlands and islands

EU:

Premium: Trackable delivery to most EU destinations within 4-9 business days.

Australia:

Economy: Can deliver to P. O. Boxes and private residences.
Trackable service with delivery to addresses in Australia only.
Delivery time ranges from 7-9 business days for VIC and 8-10 business days for Interstate metro
Delivery time is up to 15 business days for remote areas of WA, NT & QLD.

Premium: Delivery to addresses in Australia only
Trackable delivery to most P. O. Boxes and private residences in Australia within 4-5 days based on the distance to a destination following dispatch.

India:

Premium: Delivery to most Indian addresses within 5-6 business days

Rest of the World:

Premium: Countries in the American continent: Trackable delivery to most countries within 4-7 business days

Asia:

Premium: Delivery to most Asian addresses within 5-9 business days

Disclaimer:
All orders received before 5 PM U.K time would start printing from the next business day. So the estimated delivery times start from the next day as well. Orders received after 5 PM U.K time (in our internal systems) on a business day or anytime on the weekend will begin printing the second to next business day. For example, an order placed at 11 AM today will begin printing tomorrow, whereas an order placed at 9 PM tonight will begin printing the day after tomorrow.


Unfortunately, due to several restrictions, we are unable to ship to the following countries:

  1. Afghanistan
  2. American Samoa
  3. Belarus
  4. Brunei Darussalam
  5. Central African Republic
  6. The Democratic Republic of Congo
  7. Eritrea
  8. Guinea-bissau
  9. Iran
  10. Lebanon
  11. Libiya Arab Jamahriya
  12. Somalia
  13. Sudan
  14. Russian Federation
  15. Syrian Arab Republic
  16. Ukraine
  17. Venezuela
What is custom duty/charge? Chevron down icon Chevron up icon

Customs duty are charges levied on goods when they cross international borders. It is a tax that is imposed on imported goods. These duties are charged by special authorities and bodies created by local governments and are meant to protect local industries, economies, and businesses.

Do I have to pay customs charges for the print book order? Chevron down icon Chevron up icon

The orders shipped to the countries that are listed under EU27 will not bear custom charges. They are paid by Packt as part of the order.

List of EU27 countries: www.gov.uk/eu-eea:

A custom duty or localized taxes may be applicable on the shipment and would be charged by the recipient country outside of the EU27 which should be paid by the customer and these duties are not included in the shipping charges been charged on the order.

How do I know my custom duty charges? Chevron down icon Chevron up icon

The amount of duty payable varies greatly depending on the imported goods, the country of origin and several other factors like the total invoice amount or dimensions like weight, and other such criteria applicable in your country.

For example:

  • If you live in Mexico, and the declared value of your ordered items is over $ 50, for you to receive a package, you will have to pay additional import tax of 19% which will be $ 9.50 to the courier service.
  • Whereas if you live in Turkey, and the declared value of your ordered items is over € 22, for you to receive a package, you will have to pay additional import tax of 18% which will be € 3.96 to the courier service.
How can I cancel my order? Chevron down icon Chevron up icon

Cancellation Policy for Published Printed Books:

You can cancel any order within 1 hour of placing the order. Simply contact customercare@packt.com with your order details or payment transaction id. If your order has already started the shipment process, we will do our best to stop it. However, if it is already on the way to you then when you receive it, you can contact us at customercare@packt.com using the returns and refund process.

Please understand that Packt Publishing cannot provide refunds or cancel any order except for the cases described in our Return Policy (i.e. Packt Publishing agrees to replace your printed book because it arrives damaged or material defect in book), Packt Publishing will not accept returns.

What is your returns and refunds policy? Chevron down icon Chevron up icon

Return Policy:

We want you to be happy with your purchase from Packtpub.com. We will not hassle you with returning print books to us. If the print book you receive from us is incorrect, damaged, doesn't work or is unacceptably late, please contact Customer Relations Team on customercare@packt.com with the order number and issue details as explained below:

  1. If you ordered (eBook, Video or Print Book) incorrectly or accidentally, please contact Customer Relations Team on customercare@packt.com within one hour of placing the order and we will replace/refund you the item cost.
  2. Sadly, if your eBook or Video file is faulty or a fault occurs during the eBook or Video being made available to you, i.e. during download then you should contact Customer Relations Team within 14 days of purchase on customercare@packt.com who will be able to resolve this issue for you.
  3. You will have a choice of replacement or refund of the problem items.(damaged, defective or incorrect)
  4. Once Customer Care Team confirms that you will be refunded, you should receive the refund within 10 to 12 working days.
  5. If you are only requesting a refund of one book from a multiple order, then we will refund you the appropriate single item.
  6. Where the items were shipped under a free shipping offer, there will be no shipping costs to refund.

On the off chance your printed book arrives damaged, with book material defect, contact our Customer Relation Team on customercare@packt.com within 14 days of receipt of the book with appropriate evidence of damage and we will work with you to secure a replacement copy, if necessary. Please note that each printed book you order from us is individually made by Packt's professional book-printing partner which is on a print-on-demand basis.

What tax is charged? Chevron down icon Chevron up icon

Currently, no tax is charged on the purchase of any print book (subject to change based on the laws and regulations). A localized VAT fee is charged only to our European and UK customers on eBooks, Video and subscriptions that they buy. GST is charged to Indian customers for eBooks and video purchases.

What payment methods can I use? Chevron down icon Chevron up icon

You can pay with the following card types:

  1. Visa Debit
  2. Visa Credit
  3. MasterCard
  4. PayPal
What is the delivery time and cost of print books? Chevron down icon Chevron up icon

Shipping Details

USA:

'

Economy: Delivery to most addresses in the US within 10-15 business days

Premium: Trackable Delivery to most addresses in the US within 3-8 business days

UK:

Economy: Delivery to most addresses in the U.K. within 7-9 business days.
Shipments are not trackable

Premium: Trackable delivery to most addresses in the U.K. within 3-4 business days!
Add one extra business day for deliveries to Northern Ireland and Scottish Highlands and islands

EU:

Premium: Trackable delivery to most EU destinations within 4-9 business days.

Australia:

Economy: Can deliver to P. O. Boxes and private residences.
Trackable service with delivery to addresses in Australia only.
Delivery time ranges from 7-9 business days for VIC and 8-10 business days for Interstate metro
Delivery time is up to 15 business days for remote areas of WA, NT & QLD.

Premium: Delivery to addresses in Australia only
Trackable delivery to most P. O. Boxes and private residences in Australia within 4-5 days based on the distance to a destination following dispatch.

India:

Premium: Delivery to most Indian addresses within 5-6 business days

Rest of the World:

Premium: Countries in the American continent: Trackable delivery to most countries within 4-7 business days

Asia:

Premium: Delivery to most Asian addresses within 5-9 business days

Disclaimer:
All orders received before 5 PM U.K time would start printing from the next business day. So the estimated delivery times start from the next day as well. Orders received after 5 PM U.K time (in our internal systems) on a business day or anytime on the weekend will begin printing the second to next business day. For example, an order placed at 11 AM today will begin printing tomorrow, whereas an order placed at 9 PM tonight will begin printing the day after tomorrow.


Unfortunately, due to several restrictions, we are unable to ship to the following countries:

  1. Afghanistan
  2. American Samoa
  3. Belarus
  4. Brunei Darussalam
  5. Central African Republic
  6. The Democratic Republic of Congo
  7. Eritrea
  8. Guinea-bissau
  9. Iran
  10. Lebanon
  11. Libiya Arab Jamahriya
  12. Somalia
  13. Sudan
  14. Russian Federation
  15. Syrian Arab Republic
  16. Ukraine
  17. Venezuela