Implementing the Naïve Bayes classifier in Mahout

Exclusive offer: get 80% off this eBook here
Apache Mahout Cookbook

Apache Mahout Cookbook — Save 80%

A fast, fresh, developer-oriented dive into the world of Mahout with this book and ebook

₨1,349.00    ₨269.80
by Piero Giacomelli | December 2013 | Cookbooks Open Source

In this article written by Piero Giacomelli, the author of the book Apache Mahout Cookbook, we will implement the Naïve Bayes classifier for creating clusters and aggregating unstructured information in a manageable way.

We will cover the following recipes in this article:

  • Using the Mahout text classifier to demonstrate the basic use case
  • Using the Naïve Bayes classifier from code
  • Using Complementary Naïve Bayes from the command line
  • Coding the Complementary Naïve Bayes classifier

(for more resources related to this topic, see here.)

Bayes was a Presbyterian priest who died giving his "Tractatus Logicus" to the prints in 1795. The interesting fact is that we had to wait a whole century for the Boolean calculus before Bayes' work came to light in the scientific community.

The corpus of Bayes' study was conditional probability. Without entering too much into mathematical theory, we define conditional probability as the probability of an event that depends on the outcome of another event.

In this article, we are dealing with a particular type of algorithm, a classifier algorithm. Given a dataset, that is, a set of observations of many variables, a classifier is able to assign a new observation to a particular category. So, for example, consider the following table:

Outlook

Temperature

Temperature

Humidity

Humidity

Windy

Play


Numeric

Nominal

Numeric

Nominal



Overcast

83

Hot

86

High

FALSE

Yes

Overcast

64

Cool

65

Normal

TRUE

Yes

Overcast

72

Mild

90

High

TRUE

Yes

Overcast

81

Hot

75

Normal

FALSE

Yes

Rainy

70

Mild

96

High

FALSE

Yes

Rainy

68

Cool

80

Normal

FALSE

Yes

Rainy

65

Cool

70

Normal

TRUE

No

Rainy

75

Mild

80

Normal

FALSE

Yes

Rainy

71

Mild

91

High

TRUE

No

Sunny

85

Hot

85

High

FALSE

No

Sunny

80

Hot

90

High

TRUE

No

Sunny

72

Mild

95

High

FALSE

No

Sunny

69

Cool

70

Normal

FALSE

Yes

Sunny

75

Mild

70

Normal

TRUE

Yes

The table itself is composed of a set of 14 observations consisting of 7 different categories: temperature (numeric), temperature (nominal), humidity (numeric), and so on. The classifier takes some of the observations to train the algorithm and some as testing it, to create a decision for a new observation that is not contained in the original dataset.

There are many types of classifiers that can do this kind of job. The classifier algorithms are part of the supervised learning data-mining tasks that use training data to infer an outcome. The Naïve Bayes classifier uses the assumption that the fact, on observation, belongs to a particular category and is independent from belonging to any other category.

Other types of classifiers present in Mahout are the logistic regression, random forests, and boosting. Refer to the page https://cwiki.apache.org/confluence/display/MAHOUT/Algorithms for more information.

This page is updated with the algorithm type, actual integration in Mahout, and other useful information. Moving out of this context, we could describe the Naïve Bayes algorithm as a classification algorithm that uses the conditional probability to transform an initial set of weights into a weight matrix, whose entries (row by column) detail the probability that one weight is associated to the other weight. In this article's recipes, we will use the same algorithm provided by the Mahout example source code that uses the Naïve Bayes classifier to find the relation between works of a set of documents.

Our recipe can be easily extended to any kind of document or set of documents. We will only use the command line so that once the environment is set up, it will be easy for you to reproduce our recipe. Our dataset is divided into two parts: the training set and the testing set. The training set is used to instruct the algorithm on the relation it needs to find. The testing set is used to test the algorithm using some unrelated input. Let us now get a first-hand taste of how to use the Naïve Bayes classifier.

Using the Mahout text classifier to demonstrate the basic use case

The Mahout binaries contain ready-to-use scripts for using and understanding the classical Mahout dataset. We will use this dataset for testing or coding. Basically, the code is nothing more than following the Mahout ready-to-use script with the corrected parameter and the path settings done. This recipe will describe how to transform the raw text files into weight vectors that are needed by the Naïve Bayes algorithm to create the model.

The steps involved are the following:

  • Converting the raw text file into a sequence file
  • Creating vector files from the sequence files
  • Creating our working vectors

Getting ready

The first step is to download the datasets. The dataset is freely available at the following link: http://people.csail.mit.edu/jrennie/20Newsgroups/20news-bydate.tar.gz.

For classification purposes, other datasets can be found at the following URL: http://sci2s.ugr.es/keel/category.php?cat=clas#sub2.

The dataset contains a post of 20 newsgroups dumped in a text file for the purpose of machine learning. Anyway, we could have also used other documents for testing purposes, but we will suggest how to do this later in the recipe.

Before proceeding, in the command line, we need to set up the working folder where we decompress the original archive to have shorter commands when we need to insert the full path of the folder.

In our case, the working folder is /mnt/new; so, our working folder's command-line variables will be set using the following command:

export WORK_DIR=/mnt/new/

You can create a new folder and change the WORK_DIR bash variable accordingly.

Do not forget that to have these examples running, you need to run the various commands with a user that has the HADOOP_HOME and MAHOUT_HOME variables in its path.

To download the dataset, we only need to open up a terminal console and give the following command:

wget http://people.csail.mit.edu/jrennie/20Newsgroups/20news-bydate.tar.gz

Once your working dataset is downloaded, decompress it using the following command:

tar –xvzf 20news-bydate.tar.gz

You should see the folder structure as shown in the following screenshot:

The second step is to sequence the whole input file to transform them into Hadoop sequence files. To do this, you need to transform the two folders into a single one. However, this is only a pedagogical passage, but if you have multiple files containing the input texts, you could parse them separately by invoking the command multiple times. Using the console command, we can group them together as a whole by giving the following command in sequence:

rm -rf ${WORK_DIR}/20news-all mkdir ${WORK_DIR}/20news-all cp -R ${WORK_DIR}/20news-bydate*/*/* ${WORK_DIR}/20news-all

Now, we should have our input folder, which is the 20news-all folder, ready to be used:

The following screenshot shows a bunch of files, all in the same folder:

By looking at one single file, we should see the underlying structure that we will transform. The structure is as follows:

From: xxx Subject: yyyyy Organization: zzzz X-Newsreader: rusnews v1.02 Lines: 50 jaeger@xxx (xxx) writes: >In article xxx writes: >>zzzz "How BCCI adapted the Koran rules of banking". The >>Times. August 13, 1991. > > So, let's see. If some guy writes a piece with a title that implies > something is the case then it must be so, is that it?

We obviously removed the e-mail address, but you can open this file to see its content. For any newsgroup of 20 news items that are present on the dataset, we have a number of files, each of them containing a single post to a newsgroup without categorization.

Following our initial tasks, we need to now transform all these files into Hadoop sequence files. To do this, you need to just type the following command:

./mahout seqdirectory -i ${WORK_DIR}/20news-all -o ${WORK_DIR}/20news-seq

This command brings every file contained in the 20news-all folder and transforms them into a sequence file. As you can see, the number of corresponding sequence files is not one to one with the number of input files. In our case, the generated sequence files from the original 15417 text files are just one chunck-0 file. It is also possible to declare the number of output files and the mappers involved in this data transformation. We invite the reader to test the different parameters and their uses by invoking the following command:

./mahout seqdirectory --help

The following table describes the various options that can be used with the seqdirectory command:

Parameter

Description

--input (-i) input

his gives the path to the job input directory.

--output (-o) output

The directory pathname for the output.

--overwrite (-ow)

If present, overwrite the output directory before running the job.

--method (-xm) method

The execution method to use: sequential or mapreduce. The default is mapreduce.

--chunkSize (-chunk) chunkSize

The chunkSize values in megabyte. The default is 64 Mb.

--fileFilterClass (-filter) fileFilterClass

The name of the class to use for file parsing.The default is org.apache.mahout.text.PrefixAdditionFilter.

--keyPrefix (-prefix) keyPrefix

The prefix to be prepended to the key of the sequence file.

--charset (-c) charset

The name of the character encoding of the input files.The default is UTF-8.

--overwrite (-ow)

If present, overwrite the output directory before running the job.

--help (-h)

Prints the help menu to the command console.

--tempDir tempDir

If specified, tells Mahout to use this as a temporary folder.

--startPhase startPhase

Defines the first phase that needs to be run.

--endPhase endPhase

Defines the last phase that needs to be run

To examine the outcome, you can use the Hadoop command-line option fs. So, for example, if you would like to see what is in the chunck-0 file, you could type in the following command:

hadoop fs -text $WORK_DIR/20news-seq/chunck-0 | more

In our case, the result is as follows:

/67399 From:xxx Subject: Re: Imake-TeX: looking for beta testers Organization: CS Department, Dortmund University, Germany Lines: 59 Distribution: world NNTP-Posting-Host: tommy.informatik.uni-dortmund.de In article <xxxxx>, yyy writes: |> As I announced at the X Technical Conference in January, I would like |> to |> make Imake-TeX, the Imake support for using the TeX typesetting system, |> publically available. Currently Imake-TeX is in beta test here at the |> computer science department of Dortmund University, and I am looking |> for |> some more beta testers, preferably with different TeX and Imake |> installations.

The Hadoop command is pretty simple, and the syntax is as follows:

hadoop fs –text <input file>

In the preceding syntax, <input file> is the sequence file whose content you will see. Our sequence files have been created, and until now, there has been no analysis of the words and the text itself. The Naïve Bayes algorithm does not work directly with the words and the raw text, but with the weighted vector associated to the original document. So now, we need to transform the raw text into vectors of weights and frequency. To do this, we type in the following command:

./mahout seq2sparse -i ${WORK_DIR}/20news-seq -o ${WORK_DIR}/20news- vectors -lnorm -nv -wt tfidf

The following command parameters are described briefly:

  • The -lnorm parameter instructs the vector to use the L_2 norm as a distance
  • The -nv parameter is an optional parameter that outputs the vector as namedVector
  • The -wt parameter instructs which weight function needs to be used

We end the data-preparation process with this step. Now, we have the weight vector files that are created and ready to be used by the Naïve Bayes algorithm. We will clear a little while this last step algorithm. This part is about tuning the algorithm for better performance of the Naïve Bayes classifier.

How to do it…

Now that we have generated the weight vectors, we need to give them to the training algorithm. But if we train the classifier against the whole set of data, we will not be able to test the accuracy of the classifier.

To avoid this, you need to divide the vector files into two sets called the 80-20 split. This is a good data-mining approach because if you have any algorithm that should be instructed on a dataset, you should divide the whole bunch of data into two sets: one for training and one for testing your algorithm.

A good dividing percentage is shown to be 80 percent and 20 percent, meaning that the training data should be 80 percent of the total while the testing ones should be the remaining 20 percent.

To split data, we use the following command:

./mahout split -i ${WORK_DIR}/20news-vectors/tfidf-vectors --trainingOutput ${WORK_DIR}/20news-train-vectors --testOutput ${WORK_DIR}/20news-test-vectors --randomSelectionPct 40 --overwrite --sequenceFiles -xm sequential

As result of this command, we will have two new folders containing the training and testing vectors. Now, it is time to train our Naïves Bayes algorithm on the training set of vectors, and the command that is used is pretty easy:

./mahout trainnb -i ${WORK_DIR}/20news-train-vectors -el -o ${WORK_DIR}/model -li ${WORK_DIR}/labelindex -ow

Once finished, we have our training model ready to be tested against the remaining 20 percent of the initial input vectors. The final console command is as follows:

./mahout testnb -i ${WORK_DIR}/20news-test-vectors -m ${WORK_DIR}/model -l ${WORK_DIR}/labelindex\ -ow -o ${WORK_DIR}/20news-testing

The following screenshot shows the output of the preceding command:

How it works...

We have given certain commands and we have seen the outcome, but you've done this without an understanding of why we did it and above all, why we chose certain parameters. The whole sequence could be meaningless, even for an experienced coder.

Let us now go a little deeper in each step of our algorithm. Apart from downloading the data, we can divide our Naïve Bayes algorithm into three main steps:

  • Data preparation
  • Data training
  • Data testing

In general, these are the three procedures for mining data that should be followed. The data preparation steps involve all the operations that are needed to create the dataset in the format that is required for the data mining procedure. In this case, we know that the original format was a bunch of files containing text, and we transformed them into a sequence file format. The main purpose of this is to have a format that can be handled by the map reducing algorithm. This phase is a general one as the input format is not ready to be used as it is in most cases. Sometimes, we also need to merge some data if they are divided into different sources. Sometimes, we also need to use Sqoop for extracting data from different datasources.

Data training is the crucial part; from the original dataset, we extract the information that is relevant to our data mining tasks, and we bring some of them to train our model. In our case, we are trying to classify if a document can be inserted in a certain category based on the frequency of some terms in it. This will lead to a classifier that using another document can state if this document is under a previously found category. The output is a function that is able to determinate this association.

Next, we need to evaluate this function because it is possible that one good classification in the learning phase is not so good when using a different document. This three-phased approach is essential in all classification tasks. The main difference relies on the type of classifier to be used in the training and testing phase. In this case, we use Naïve Bayes, but other classifiers can be used as well. In the Mahout framework, the available classifiers are Naïve Bayes, Decision Forest, and Logistic Regression.

As we have seen, the data preparation consists basically of creating two series of files that will be used for training and testing purposes. The step to transform the raw text file into a Hadoop sequence format is pretty easy; so, we won't spend too long on it. But the next step is the most important one during data preparation. Let us recall it:

mahout seq2sparse -i ${WORK_DIR}/20news-seq -o ${WORK_DIR}/20news- vectors -lnorm -nv -wt tfidf

This computational step basically grabs the whole text from the chunck-0 sequence file and starts parsing it to extract information from the words contained in it. The input parameters tell the utility to work in the following ways:

  • The -i parameter is used to declare the input folder where all the sequence files are stored
  • The -o parameter is used to create the output folder where the vector containing the weights is stored
  • The -nv parameter tells Mahout that the output format should be in the namedVector format
  • The -wt parameter tells which frequency function to use for evaluating the weight of every term to a category
  • The -lnorm parameter is a function used to normalize the weights using the L_2 distance
  • The -ow: parameter overwrites the previously generated output results
  • The -m: parameter gives the minimum log-likelihood ratio

The whole purpose of this computation step is to transform the sequence files that contain the documents' raw text in the sequence files containing vectors that count the frequency of the term. Obviously, there are some different functions that count the frequency of a term within the whole set of documents. So, in Mahout, the possible values for the wt parameter are tf and tfidf. The Tf value is the simpler one and counts the frequency of the term. This means that the frequency of the Wi term inside the set of documents is the ratio between the total occurrence of the word over the total number of words. The second one considers the sum of every term frequency using a logarithmic function like this one:

In the preceding formula, Wi is the TF-IDF weight of the word indexed by i. N is the total number of documents. DFi is the frequency of the i word in all the documents.

In this preprocessing phase, we notice that we index the whole corpus of documents so that we are sure that even if we divide or split in the next phase, the documents are not affected. We compute a word frequency; this means that the word was contained in the training or testing set.

So, the reader should grasp the fact that changing this parameter can affect the final weight vectors; so, based on the same text, we could have very different outcomes.

The lnorm value basically means that while the weight can be a number ranging from 0 to an upper positive integer, they are normalized to 1 as the maximum possible weight for a word inside the frequency range. The following screenshot shows the output of the output folder:

Various folders are created for storing the word count, frequency, and so on. Basically, this is because the Naïve Bayes classifier works by removing all periods and punctuation marks from the text. Then, from every text, it extracts the categories and the words.

The final vector file can be seen in the tfidf-vectors folder, and for dumping vector files to normal text ones, you can use the vectordump command as follows:

mahout vectordump -i ${WORK_DIR}/20news-vectors/tfidf-vectors/ part-r-00000 –o ${WORK_DIR}/20news-vectors/tfidf-vectors/part-r-00000dump

The dictionary files and word files are sequence files containing the association within the unique key/word created by the MapReduce algorithm using the command:

hadoop fs -text $WORK_DIR/20news-vectors/dictionary.file-0 | more one can see for example adrenal_gland 12912 adrenaline 12913 adrenaline.com 12914|

The splitting of the dataset into training and testing is done by using the split command-line option of Mahout. The interesting parameter in this case is that randomSelectionPct equals 40. It uses a random selection to evaluate which point belongs to the training or the testing dataset.

Now comes the interesting part. We are ready to train using the Naïve Bayes algorithm. The output of this algorithm is the model folder that contains the model in the form of a binary file. This file represents the Naïve Bayes model that holds the weight Matrix, the feature and label sums, and the weight normalizer vectors generated so far.

Now that we have the model, we test it on the training set. The outcome is directly shown on the command line in terms of a confusion matrix. The following screenshot shows the format in which we can see our result.

Finally, we test our classifier on the test vector generated by the split instruction. The output in this case is a confusion matrix. Its format is as shown in the following screenshot:

We are now going to provide details on how this matrix should be interpreted. As you can see, we have the total classified instances that tell us how many sentences have been analyzed. Above this, we have the correctly/incorrectly classified instances. In our case, this means that on a test set of weighted vectors, we have nearly 90 percent of the corrected classified sentences against an error of 9 percent.

But if we go through the matrix row by row, we can see at the end that we have different newsgroups. So, a is equal to alt.atheism and b is equal to comp.graphics.

So, a first look at the detailed confusion matrix tells us that we did the best in classification against the rec.sport.hockey newsgroup, with a value of 418 that is the highest we have. If we take a look at the corresponding row, we understand that of these 418 classified sentences, we have 403/412; so, 97 percent of all of the sentences were found in the rec.sport.hockey newsgroup. But if we take a look at the comp.os.ms-windows.miscwe newsgroup, we can see overall performance is low. The sentences are not so centered around the same new newsgroup; so, it means that we find and classify the sentences in ms-windows in another newsgroup, and so we do not have a good classification.

This is reasonable as sports terms like "hockey" are really limited to the hockey world, while sentences about Microsoft could be found both on Microsoft specific newsgroups and in other newsgroups.

We encourage you to give another run to the testing phase on the training phase to see the output of the confusion matrix by giving the following command:

./bin/mahout testnb -i ${WORK_DIR}/20news-train-vectors -m ${WORK_DIR}/model -l ${WORK_DIR}/labelindex -ow -o ${WORK_DIR}/20news-testing

As you can see, the input folder is the same for the training phase, and in this case, we have the following confusion matrix:

In this case, we can see it using the same set both as the training and testing phase. The first consequence is that we have a rise in the correctly classified sentences by an order of 10 percent, which is even bigger if you remember that in terms of weighted vectors with respect to the testing phase, we have a size that is four times greater. But probably the most important thing is that the best classification has now moved from the hockey newsgroup to the sci.electronics newsgroup.

There's more

We use exactly the same procedure used by the Mahout examples contained in the binaries folder that we downloaded. But you should now be aware that starting all process need only to change the input files from the initial folder. So, for the willing reader, we suggest you download another raw text file and perform all the steps in another type of file to see the changes that we have compared to the initial input text.

We would suggest that non-native English readers also look at the differences that we have by changing the initial input set with one not written in English. Since the whole text is transformed using only weight vectors, the outcome does not depend on the difference between languages but only on the probability of finding certain word couples.

As a final step, using the same input texts, you could try to change the way the algorithm normalizes and counts the words to create the vector sparse weights. This could be easily done by changing, for example, the -wt tfidf parameter into the command line Mahout seq2sparce. So, for example, an alternative run of the seq2sparce Mahout could be the following one:

mahout seq2sparse -i ${WORK_DIR}/20news-seq -o ${WORK_DIR}/20news- vectors -lnorm -nv -wt tfidf

Finally, we not only choose to run the Naïve Bayes classifier for classifying words in a text document but also the algorithm that uses vectors of weights so that, for example, it would be easy to create your own vector weights.

Apache Mahout Cookbook A fast, fresh, developer-oriented dive into the world of Mahout with this book and ebook
Published: December 2013
eBook Price: ₨1,349.00
Book Price: ₨2,249.00
See more
Select your format and quantity:

Using the Naïve Bayes classifier from code

Now, we have used Mahout with the command-line option for the Naïve Bayes classification. In this recipe, we will code the same classifier that we used in the previous recipe, but we will call and use it directly from the Java code instead of the command line. We will see how to tune parameters and how to extend the possible configuration parameter. We will also see how to use the Naïve Bayes classifier from the code, and we will show you the possibility of changing some parameters that cannot be modified using the command line. Coding a classifier using the MapReduce framework could be a difficult task, but it would be better if you could use the already coded classifier to fine-tune your data-mining tasks

Getting ready

To be ready for the coding part, we have to create a new project on NetBeans and link the correct POM files for the dependency libraries that we are going to use. So, using the Netbeans Maven functionality, create a new Maven project called chapter04. Now, you should have something similar to the following screenshot:

Now, it is time to add the dependencies needed to invoke the Naïve Bayes classifier. To do this, right-click on the Dependencies folder and choose the Add Dependency item option from the pop-up menu. The following screenshot shows the form where you type the mahout word in the Search text so that the system will display the local Mahout compiled source. The JARs required are mahout-core and mahout-math.

How to do it…

Now, we have everything that we need to code our example. Carry out the following steps in order to achieve this:

  1. Open up the app.java default file. First, we need to set the parameters to be used:

    final BayesParameters params = new BayesParameters(); params.setGramSize( 1 ); params.set( "verbose", "true" ); params.set( "classifierType", "bayes" ); params.set( "defaultCat", "OTHER" );Chapter 4 params.set( "encoding", "UTF-8" ); params.set( "alpha_i", "1.0" ); params.set( "dataSource", "hdfs" ); params.set( "basePath", "/tmp/output" );>

  2. Then, we need to train the classifier by providing the input and output folders for the input to be used and for the output of the model, respectively:

    try { Path input = new Path( "/tmp/input" ); TrainClassifier.trainNaiveBayes( input, "/tmp/output", params );

  3. Next, we need to use the Bayes algorithm to evaluate the classifier as follows:

    Algorithm algorithm = new BayesAlgorithm(); Datastore datastore = new InMemoryBayesDatastore( params ); ClassifierContext classifier = new ClassifierContext( algorithm, datastore ); classifier.initialize(); final BufferedReader reader = new BufferedReader( new FileReader( args[ 0 ] ) ); String entry = reader.readLine(); while( entry != null ) { List< String > document = new NGrams( entry, Integer.parseInt( params.get( "gramSize" ) ) ) .generateNGramsWithoutLabel(); ClassifierResult result = classifier.classifyDocument( document.toArray( new String[ document.size() ] ), params.get( "defaultCat" ) ); entry = reader.readLine(); } } catch( final IOException ex ) { ex.printStackTrace(); } catch( final InvalidDatastoreException ex ) { ex.printStackTrace(); }

Simple, isn't it? Once compiled, check the input files and observe the output of the code that is provided. We are now ready to go into the details to understand the differences between the Mahout classifiers and to underline the possibilities offered by them.

How it works...

As we can see, we have the following actions:

  • Initializing the parameters for the trainer
  • Reading the input files
  • Training the classifier using the parameters and the input files
  • Output the results

First of all, we initialize the parameters' objects that store every available input parameter for the classifier. We should notice the following parameters:

params.set( "classifierType", "bayes" );

Until now, we have used the Naïve Bayes classifier. Once we initialize the parameters, we are ready to move to the core part. A TrainClassifier .trainNaiveBayes static method is invoked by passing the parameters and the input path to the weight vector files.

This phase builds the binary model file that is saved into the output folder and defines itself into the params object using the following code:

params.set( "basePath", "/tmp/output" );

So, we now have our model saved and stored.

Try creating two different classifiers to be trained on the same input vector to have a model ready to be tested on a different set of data. This idea is also a good one in the development phase before going into production to evaluate which is the best algorithm to be used for training purposes.

As a final step, we need to read the input file and generate NGrams and use this according to the classifier used. This can be done using the following code:

final BufferedReader reader = new BufferedReader( new FileReader( args[ 0 ] ) ); String entry = reader.readLine(); while( entry != null ) { List< String > document = new NGrams( entry, Integer.parseInt( params.get( "gramSize" ) ) ) .generateNGramsWithoutLabel(); ClassifierResult result = classifier.classifyDocument( document.toArray( new String[ document. size() ] ), params.get( "defaultCat" ) ); entry = reader.readLine();

We need to provide a few details on how the Naïve Bayes classifier works. In a sentence, a group of words is called an n-grams. A single word is a 1-gram; but, for example, Barack Obama, even if composed of two 1-grams, is normally associated, so it is counted as a 1-gram.

So, in this case, the minimum gram size is set to 1 in the parameters with the following code:

params.setGramSize( 1 );

However, you can change this minimum size for the grams so as to avoid considering single words but counting the frequency of at least a couple of words combined instead. So, setting setGramSize to 1 means that "Obama" as well as "Barack" is counted as one while with 2, the occurrence "Barack Obama" will be see as one in frequency count.

Now, we can test against a different data set by re-using Mahout's testnn command-line option by giving the generated model and the input corpus document provided for testing. This will display the confusion matrix that will allow us to evaluate better how the training took place.

There's more

Text classification is probably one of the most interesting tasks in a classification algorithm. This is because teaching a machine to make sense of a document the way we humans do is never easy.

A lot of settings eventuate to create a good classifier. Let us review them briefly:

  • The language of the documents
  • The size of the documents
  • The way we create vectors
  • The way we divide and create the training and testing sets

Only to give some hits, if we have a document with many sentences written in a different language, the classification task is not so easy. The size of the documents is another important task; obviously, the more you have the better it is, but beware. We do not have a cluster large enough to test this, but as pointed out by the Mahout project site, the Naïve Bayes algorithm and the Complementary Naïve Bayes algorithm can handle millions to hundreds of millions of documents.

As we mentioned earlier, the vectors can be created using different ways to evaluate the word count and the granularity. For the willing reader, we point out that there are other techniques that can be used for counting words or to decide the best frequency to be used for counting occurrences. We encourage you to take a look at the source code of Mahout to see how it is possible to extend the analyzer class to add some of these evaluation methods.

When evaluating a classifier, we also recommend you to give multiple runs of the algorithm using the code we provided. These multiple runs give the trainer an algorithm to have different outcomes so that it would be possible to evaluate better which parameter combination is the best one for the input dataset.

Using Complementary Naïve Bayes from the command line

We are now ready to use the Complementary Naïve Bayes Mahout classifier from the command line.

Getting ready

We are ready with the input because we will use the same input that we used for the Naïve Bayes classifier. So, this recipe will be prepared for the same way that we prepared for the Naïve Bayes classifier. So, in the ${WORK_DIR}/20news-train-vectors folder, you should have the training vectors with the weight coded as a key/value pair that is ready to be used.

How to do it…

Now, we can enter the following command from a terminal console:

./mahout trainnb -i ${WORK_DIR}/20news-train-vectors -el -o ${WORK_DIR}/model -li ${WORK_DIR}/labelindex -ow cbayes

The output will be a model created as a binary file in the model subfolder. To test the Complementary Naïve Bayes classifier, we give the following command:

./bin/mahout testnb -i ${WORK_DIR}/20news-train-vectors -m ${WORK_DIR}/model -l ${WORK_DIR}/labelindex -ow -o ${WORK_DIR}/20news-testing cbayes

How it works…

The Complementary Naïve Bayes classifier is linked to the Naïve Bayes classifier. Basically, the main difference between the two is how they evaluate the final weight of the classifier. In the Complementary Naïve Bayes case, the weight is calculated using the following formula:

Weight = Log [ ( Sigma_j - W-N-Tf-Idf + alpha_i ) / ( Sigma_jSigma_k
- Sigma_k + N ) ]

So you could easily change the type of algorithm from Naïve Bayes to Complementary Naïve Bayes by only changing a parameter.

See also

You can refer to the Tackling the Poor Assumptions of Naïve Bayes Text Classifiers paper for an evaluation of the differences between the two classifiers. It is available at http://people.csail.mit.edu/jrennie/papers/icml03-nb.pdf.

Coding the Complementary Naïve Bayes classifier

Now that we have seen how the Complementary Naïve Bayes classifier can be invoked from the command line, we are ready to use it from our Java code.

Getting ready

Create a new Maven project and link it to your local Mahout-compiled Maven project. In this case, since we have already created the chapter04 Mahout project, we will simply add to it a new Java class. If you're referring to this recipe directly, refer to the previous Getting Ready section on how to set up the project.

So, fire up NetBeans, right-click on chapter04, and choose New Java Class. Complete the form as shown in the following screenshot, and click on Finish.

How to do it…

Now that our class is ready to use, you could type in the same code that we used in the previous recipe. We refer you to the previous code. We only need one little change in the code; we need to change the params.set( "classifierType", "bayes" ); line to the following line:

params.set( "classifierType", "cbayes" );

This will change the trainer by telling it to use the cbayes weight formula to calculate the weights.

How it works...

As we have seen, the same parameters are used from the command line. In fact, we only recoded the same behavior of the command-line utility.

Take a look at the Mahout source code to understand how the trainer classifier changes between the Naïve Bayes case and the Complementary Naïve Bayes case. The purpose is only to show that applying different classifiers to the same input is easier to code and could help in fine-tuning the data-mining tasks.

For a more general introduction on the Complementary Naïve Bayes classifier in Mahout, refer to the Mahout website, https://cwiki.apache.org/confluence/display/MAHOUT/Bayesian.

We only point out these little changes to show that you could use the existing Naïve Bayes classifier and change it according to the way you would like.

The main difference between Naïve Bayes and Complementary Naïve Bayes is the way in which the two algorithms calculate the weight of words. So basically, the only change from an algorithmic perspective is that of a function. However, you could adapt the function to what you would like to test.

Resources for article:


further resources on this subject:


Apache Mahout Cookbook A fast, fresh, developer-oriented dive into the world of Mahout with this book and ebook
Published: December 2013
eBook Price: ₨1,349.00
Book Price: ₨2,249.00
See more
Select your format and quantity:

About the Author :


Piero Giacomelli

Piero Giacomelli started playing with computers back in 1986 when he received his first PC (a commodore 64). Despite his love for computers, he graduated in Mathematics, entered the professional software industry in 1997, and started using Java.

He has been involved in a lot of software projects using Java, .NET, and PHP. He is not only a great fan of JBoss and Apache technologies, but also uses Microsoft technologies without moral issues.

He has worked in many different industrial sectors, such as aerospace, ISP, textile and plastic manufacturing, and e-health association, both as a software developer and as an IT manager. He has also been involved in many EU research-funded projects in FP7 EU programs, such as CHRONIOUS, I-DONT-FALL, FEARLESS, and CHROMED.

In recent years, he has published some papers on scientific journals and has been awarded two best paper awards by the International Academy, Research and Industry Association (IARIA).

In 2012, he published HornetQ Messaging Developer's Guide, Packt Publishing, which is a standard reference book for the Apache HornetQ Framework.

He is married with two kids, and in his spare time, he regresses to his infancy ages to play with toys and his kids.

Books From Packt


 Instant Apache Wicket 6 [Instant]
Instant Apache Wicket 6 [Instant]

Apache Solr 3.1 Cookbook
Apache Solr 3.1 Cookbook

Apache Axis2 Web Services, 2nd Edition
Apache Axis2 Web Services, 2nd Edition

 Apache Maven 3 Cookbook
Apache Maven 3 Cookbook

 Apache Flume: Distributed Log Collection for Hadoop
Apache Flume: Distributed Log Collection for Hadoop

Instant Apache Solr for Indexing Data How-to [Instant]
Instant Apache Solr for Indexing Data How-to [Instant]

Apache Kafka
Apache Kafka

 Instant Apache Maven Starter [Instant]
Instant Apache Maven Starter [Instant]


Code Download and Errata
Packt Anytime, Anywhere
Register Books
Print Upgrades
eBook Downloads
Video Support
Contact Us
Awards Voting Nominations Previous Winners
Judges Open Source CMS Hall Of Fame CMS Most Promising Open Source Project Open Source E-Commerce Applications Open Source JavaScript Library Open Source Graphics Software
Resources
Open Source CMS Hall Of Fame CMS Most Promising Open Source Project Open Source E-Commerce Applications Open Source JavaScript Library Open Source Graphics Software