Reader small image

You're reading from  Python 3 Text Processing with NLTK 3 Cookbook

Product typeBook
Published inAug 2014
Reading LevelBeginner
Publisher
ISBN-139781782167853
Edition1st Edition
Languages
Tools
Right arrow
Author (1)
Jacob Perkins
Jacob Perkins
author image
Jacob Perkins

Jacob Perkins is the cofounder and CTO of Weotta, a local search company. Weotta uses NLP and machine learning to create powerful and easy-to-use natural language search for what to do and where to go. He is the author of Python Text Processing with NLTK 2.0 Cookbook, Packt Publishing, and has contributed a chapter to the Bad Data Handbook, O'Reilly Media. He writes about NLTK, Python, and other technology topics at http://streamhacker.com. To demonstrate the capabilities of NLTK and natural language processing, he developed http://text-processing.com, which provides simple demos and NLP APIs for commercial use. He has contributed to various open source projects, including NLTK, and created NLTK-Trainer to simplify the process of training NLTK models. For more information, visit https://github.com/japerk/nltk-trainer.
Read more about Jacob Perkins

Right arrow

Chapter 7. Text Classification

In this chapter, we will cover the following recipes:

  • Bag of words feature extraction

  • Training a Naive Bayes classifier

  • Training a decision tree classifier

  • Training a maximum entropy classifier

  • Training scikit-learn classifiers

  • Measuring precision and recall of a classifier

  • Calculating high information words

  • Combining classifiers with voting

  • Classifying with multiple binary classifiers

  • Training a classifier with NLTK-Trainer

Introduction


Text classification is a way to categorize documents or pieces of text. By examining the word usage in a piece of text, classifiers can decide what class label to assign to it. A binary classifier decides between two labels, such as positive or negative. The text can either be one label or another, but not both, whereas a multi-label classifier can assign one or more labels to a piece of text.

Classification works by learning from labeled feature sets, or training data, to later classify an unlabeled feature set. A labeled feature set is simply a tuple that looks like (feat, label), while an unlabeled feature set is a feat by itself. A feature set is basically a key-value mapping of feature names to feature values. In the case of text classification, the feature names are usually words, and the values are all True. As the documents may have unknown words, and the number of possible words may be very large, words that don't occur in the text are omitted, instead of including...

Bag of words feature extraction


Text feature extraction is the process of transforming what is essentially a list of words into a feature set that is usable by a classifier. The NLTK classifiers expect dict style feature sets, so we must therefore transform our text into a dict. The bag of words model is the simplest method; it constructs a word presence feature set from all the words of an instance. This method doesn't care about the order of the words, or how many times a word occurs, all that matters is whether the word is present in a list of words.

How to do it...

The idea is to convert a list of words into a dict, where each word becomes a key with the value True. The bag_of_words() function in featx.py looks like this:

def bag_of_words(words):
  return dict([(word, True) for word in words])

We can use it with a list of words; in this case, the tokenized sentence the quick brown fox:

>>> from featx import bag_of_words
>>> bag_of_words(['the', 'quick', 'brown', 'fox'...

Training a Naive Bayes classifier


Now that we can extract features from text, we can train a classifier. The easiest classifier to get started with is the NaiveBayesClassifier class. It uses the Bayes theorem to predict the probability that a given feature set belongs to a particular label. The formula is:

P(label | features) = P(label) * P(features | label) / P(features)

The following list describes the various parameters from the previous formula:

  • P(label): This is the prior probability of the label occurring, which is the likelihood that a random feature set will have the label. This is based on the number of training instances with the label compared to the total number of training instances. For example, if 60/100 training instances have the label, the prior probability of the label is 60%.

  • P(features | label): This is the prior probability of a given feature set being classified as that label. This is based on which features have occurred with each label in the training data.

  • P(features...

Training a decision tree classifier


The DecisionTreeClassifier class works by creating a tree structure, where each node corresponds to a feature name and the branches correspond to the feature values. Tracing down the branches, you get to the leaves of the tree, which are the classification labels.

How to do it...

Using the same train_feats and test_feats variables we created from the movie_reviews corpus in the previous recipe, we can call the DecisionTreeClassifier.train() class method to get a trained classifier. We pass binary=True because all of our features are binary: either the word is present or it's not. For other classification use cases where you have multivalued features, you will want to stick to the default binary=False.

Note

In this context, binary refers to feature values, and is not to be confused with a binary classifier. Our word features are binary because the value is either True or the word is not present. If our features could take more than two values, we would have...

Training a maximum entropy classifier


The third classifier we will cover is the MaxentClassifier class, also known as a conditional exponential classifier or logistic regression classifier. The maximum entropy classifier converts labeled feature sets to vectors using encoding. This encoded vector is then used to calculate weights for each feature that can then be combined to determine the most likely label for a feature set. For more details on the math behind this, see https://en.wikipedia.org/wiki/Maximum_entropy_classifier.

Getting ready

The MaxentClassifier class requires the NumPy package. This is because the feature encodings use NumPy arrays. You can find installation details at the following link:

http://www.scipy.org/Installing_SciPy

Note

The MaxentClassifier class algorithms can be quite memory hungry, so you may want to quit all your other programs while training a MaxentClassifier class, just to be safe.

How to do it...

We will use the same train_feats and test_feats variables from...

Training scikit-learn classifiers


Scikit-learn is one of the best machine learning libraries available in any programming language. It contains all sorts of machine learning algorithms for many different purposes, but they all follow the same fit/predict design pattern:

  • Fit the model to the data

  • Use the model to make predictions

We won't be accessing the scikit-learn models directly in this recipe. Instead, we'll be using NLTK's SklearnClassifier class, which is a wrapper class around a scikit-learn model to make it conform to NLTK's ClassifierI interface. This means that the SklearnClassifier class can be trained and used much like the classifiers we've used in the previous recipes in this chapter.

Note

I may use the terms scikit-learn and sklearn interchangeably in this recipe.

Getting ready

To use the SklearnClassifier class, you must have scikit-learn installed. Instructions are available online at http://scikit-learn.org/stable/install.html. If you have all the dependencies installed, such...

Measuring precision and recall of a classifier


In addition to accuracy, there are a number of other metrics used to evaluate classifiers. Two of the most common are precision and recall. To understand these two metrics, we must first understand false positives and false negatives. False positives happen when a classifier classifies a feature set with a label it shouldn't have gotten. False negatives happen when a classifier doesn't assign a label to a feature set that should have it. In a binary classifier, these errors happen at the same time.

Here's an example: the classifier classifies a movie review as pos when it should have been neg. This counts as a false positive for the pos label, and a false negative for the neg label. If the classifier had correctly guessed neg, then it would count as a true positive for the neg label, and a true negative for the pos label.

How does this apply to precision and recall? Precision is the lack of false positives, and recall is the lack of false...

Calculating high information words


A high information word is a word that is strongly biased towards a single classification label. These are the kinds of words we saw when we called the show_most_informative_features() method on both the NaiveBayesClassifier class and the MaxentClassifier class. Somewhat surprisingly, the top words are different for both classifiers. This discrepancy is due to how each classifier calculates the significance of each feature, and it's actually beneficial to have these different methods as they can be combined to improve accuracy, as we will see in the next recipe, Combining classifiers with voting.

The low information words are words that are common to all labels. It may be counter-intuitive, but eliminating these words from the training data can actually improve accuracy, precision, and recall. The reason this works is that using only high information words reduces the noise and confusion of a classifier's internal model. If all the words/features are highly...

Combining classifiers with voting


One way to improve classification performance is to combine classifiers. The simplest way to combine multiple classifiers is to use voting, and choose whichever label gets the most votes. For this style of voting, it's best to have an odd number of classifiers so that there are no ties. This means combining at least three classifiers together. The individual classifiers should also use different algorithms; the idea is that multiple algorithms are better than one, and the combination of many can compensate for individual bias. However, combining a poorly performing classifier with better performing classifiers is generally not a good idea, because the poor performance of one classifier can bring the total accuracy down.

Getting ready

As we need to have at least three trained classifiers to combine, we are going to use a NaiveBayesClassifier class, a DecisionTreeClassifier class, and a MaxentClassifier class, all trained on the highest information words of...

Classifying with multiple binary classifiers


So far we have focused on binary classifiers, which classify with one of two possible labels. The same techniques for training a binary classifier can also be used to create a multi-class classifier, which is a classifier that can classify with one of the many possible labels. But there are also cases where you need to be able to classify with multiple labels. A classifier that can return more than one label is a multi-label classifier.

A common technique for creating a multi-label classifier is to combine many binary classifiers, one for each label. You train each binary classifier so that it either returns a known label or returns something else to signal that the label does not apply. Then, you can run all the binary classifiers on your feature set to collect all the applicable labels.

Getting ready

The reuters corpus contains multi-labeled text that we can use for training and evaluation:

>>> from nltk.corpus import reuters
>>...

Training a classifier with NLTK-Trainer


In this recipe, we'll cover the train_classifier.py script from NLTK-Trainer, which lets you train NLTK classifiers from the command line. NLTK-Trainer was previously introduced at the end of Chapter 4, Part-of-speech Tagging, and again at the end of Chapter 5, Extracting Chunks.

Note

You can find NLTK-Trainer at https://github.com/japerk/nltk-trainer and the online documentation at http://nltk-trainer.readthedocs.org/.

How to do it...

Like train_tagger.py and train_chunker.py, the only required argument for train_classifier.py is the name of a corpus. The corpus must have a categories() method, because text classification is all about learning to classify categories. Here's an example of running train_classifier.py on the movie_reviews corpus:

$ python train_classifier.py movie_reviews
loading movie_reviews
2 labels: ['neg', 'pos']
using bag of words feature extraction
2000 training feats, 2000 testing feats
training NaiveBayes classifier
accuracy: 0...
lock icon
The rest of the chapter is locked
You have been reading a chapter from
Python 3 Text Processing with NLTK 3 Cookbook
Published in: Aug 2014Publisher: ISBN-13: 9781782167853
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Jacob Perkins

Jacob Perkins is the cofounder and CTO of Weotta, a local search company. Weotta uses NLP and machine learning to create powerful and easy-to-use natural language search for what to do and where to go. He is the author of Python Text Processing with NLTK 2.0 Cookbook, Packt Publishing, and has contributed a chapter to the Bad Data Handbook, O'Reilly Media. He writes about NLTK, Python, and other technology topics at http://streamhacker.com. To demonstrate the capabilities of NLTK and natural language processing, he developed http://text-processing.com, which provides simple demos and NLP APIs for commercial use. He has contributed to various open source projects, including NLTK, and created NLTK-Trainer to simplify the process of training NLTK models. For more information, visit https://github.com/japerk/nltk-trainer.
Read more about Jacob Perkins