Mahout core JAR files have the implementation of the main machine learning classes and the Mahout examples
JAR file has some example code and wrappers built around the Mahout core classes. It is worth spending time going through the documentation and getting an overall understanding. The documentation for the version you are using can be found in the Mahout installation directory.
We will now look at a Mahout code example. We will write a classification example in which we will train an algorithm to predict whether a client has subscribed to a term deposit. Classification refers to the process of labeling a particular instance or row to a particular predefined category, called a class label. The purpose of the following example is to give you a hang of the development using Mahout, Eclipse, and Maven.
We will use the bank-additional-full.csv
file present in the learningApacheMahout/data/chapter4
directory as the input for our example. First, let's have a look at the structure of the data and try to understand it. The following table shows various input variables along with their data types:
Based on many attributes of the customer, we try to predict the target variable y (has the client subscribed to a term deposit?), which has a set of two predefined values, Yes and No. We need to remove the header line to use the data.
We will use logistic regression to build the model; logistic regression is a statistical technique that computes the probability of an unclassified item belonging to a predefined class.
You might like to run the example with the code in the source code that ships with this book; I will explain the important steps in the following section. In Eclipse, open the code file OnlineLogisticRegressionTrain.java
from the package chapter4.logistic.src
, which is present in the directory learningApacheMahout/src/main/java/chapter4/src/logistic
in the code folder that comes with this book.
The first step is to identify the source and target folders:
Once we know where to get the data from, we need to tell the algorithm about how to interpret the data. We pass the column name and the corresponding column type; here, n denotes the numeric column and w, the categorical columns of the data:
Set the classifier parameters. LogisticModelParameters
is a wrapper class, in Mahout's example distribution, used to set the parameters for training logistic regression and to return the instance of a CsvRecordFactory
:
We set the the target variable y to be used for training, the maximum number of target categories to be 2
(Yes, No), the number of features or columns in the data excluding the target variable (which is 20
), and some other settings (which we will learn about later in this book). The variable passed has been given a value of 50
, which means the maximum number of iteration over the data will be 50.
The CsvRecordFactory
class returns an object to parse the CSV file based on the parameters passed. The LogisticModelParameters
class takes care of passing the parameters to the constructor of CsvRecordFactory
. We use the class RandomAccessSparseVector
to encode the data into vectors and train the model using lr.train(targetValue, input)
:
The output after running the code would be an equation denoting the logistic regression. Excerpts of the equation are copied here:
You will learn about logistic regression, how to interpret the equation, and how to evaluate the results in detail in Chapter 4, Classification with Mahout.