Packt+ | Advance your knowledge in tech

You're reading from Python 3 Text Processing with NLTK 3 Cookbook

Product typeBook

Published inAug 2014

Reading LevelBeginner

Publisher

ISBN-139781782167853

Edition1st Edition

Languages

Python

Tools

NLTK

Concepts

Data Processing

Author (1)

Jacob Perkins

Chapter 7. Text Classification

In this chapter, we will cover the following recipes:

Bag of words feature extraction
Training a Naive Bayes classifier
Training a decision tree classifier
Training a maximum entropy classifier
Training scikit-learn classifiers
Measuring precision and recall of a classifier
Calculating high information words
Combining classifiers with voting
Classifying with multiple binary classifiers
Training a classifier with NLTK-Trainer

Introduction

Text classification is a way to categorize documents or pieces of text. By examining the word usage in a piece of text, classifiers can decide what class label to assign to it. A binary classifier decides between two labels, such as positive or negative. The text can either be one label or another, but not both, whereas a multi-label classifier can assign one or more labels to a piece of text.

Classification works by learning from labeled feature sets, or training data, to later classify an unlabeled feature set. A labeled feature set is simply a tuple that looks like (feat, label), while an unlabeled feature set is a feat by itself. A feature set is basically a key-value mapping of feature names to feature values. In the case of text classification, the feature names are usually words, and the values are all True. As the documents may have unknown words, and the number of possible words may be very large, words that don't occur in the text are omitted, instead of including...

Bag of words feature extraction

Text feature extraction is the process of transforming what is essentially a list of words into a feature set that is usable by a classifier. The NLTK classifiers expect dict style feature sets, so we must therefore transform our text into a dict. The bag of words model is the simplest method; it constructs a word presence feature set from all the words of an instance. This method doesn't care about the order of the words, or how many times a word occurs, all that matters is whether the word is present in a list of words.

How to do it...

The idea is to convert a list of words into a dict, where each word becomes a key with the value True. The bag_of_words() function in featx.py looks like this:

def bag_of_words(words):
  return dict([(word, True) for word in words])

We can use it with a list of words; in this case, the tokenized sentence the quick brown fox:

>>> from featx import bag_of_words
>>> bag_of_words(['the', 'quick', 'brown', 'fox'...

Training a Naive Bayes classifier

Now that we can extract features from text, we can train a classifier. The easiest classifier to get started with is the NaiveBayesClassifier class. It uses the Bayes theorem to predict the probability that a given feature set belongs to a particular label. The formula is:

P(label | features) = P(label) * P(features | label) / P(features)

The following list describes the various parameters from the previous formula:

P(label): This is the prior probability of the label occurring, which is the likelihood that a random feature set will have the label. This is based on the number of training instances with the label compared to the total number of training instances. For example, if 60/100 training instances have the label, the prior probability of the label is 60%.
P(features | label): This is the prior probability of a given feature set being classified as that label. This is based on which features have occurred with each label in the training data.
P(features...

Training a decision tree classifier

The DecisionTreeClassifier class works by creating a tree structure, where each node corresponds to a feature name and the branches correspond to the feature values. Tracing down the branches, you get to the leaves of the tree, which are the classification labels.

How to do it...

Using the same train_feats and test_feats variables we created from the movie_reviews corpus in the previous recipe, we can call the DecisionTreeClassifier.train() class method to get a trained classifier. We pass binary=True because all of our features are binary: either the word is present or it's not. For other classification use cases where you have multivalued features, you will want to stick to the default binary=False.

Note

In this context, binary refers to feature values, and is not to be confused with a binary classifier. Our word features are binary because the value is either True or the word is not present. If our features could take more than two values, we would have...

Training a maximum entropy classifier

The third classifier we will cover is the MaxentClassifier class, also known as a conditional exponential classifier or logistic regression classifier. The maximum entropy classifier converts labeled feature sets to vectors using encoding. This encoded vector is then used to calculate weights for each feature that can then be combined to determine the most likely label for a feature set. For more details on the math behind this, see https://en.wikipedia.org/wiki/Maximum_entropy_classifier.

Getting ready

The MaxentClassifier class requires the NumPy package. This is because the feature encodings use NumPy arrays. You can find installation details at the following link:

http://www.scipy.org/Installing_SciPy

Note

The MaxentClassifier class algorithms can be quite memory hungry, so you may want to quit all your other programs while training a MaxentClassifier class, just to be safe.

How to do it...

We will use the same train_feats and test_feats variables from...

Training scikit-learn classifiers

Scikit-learn is one of the best machine learning libraries available in any programming language. It contains all sorts of machine learning algorithms for many different purposes, but they all follow the same fit/predict design pattern:

Fit the model to the data
Use the model to make predictions

We won't be accessing the scikit-learn models directly in this recipe. Instead, we'll be using NLTK's SklearnClassifier class, which is a wrapper class around a scikit-learn model to make it conform to NLTK's ClassifierI interface. This means that the SklearnClassifier class can be trained and used much like the classifiers we've used in the previous recipes in this chapter.

Note

I may use the terms scikit-learn and sklearn interchangeably in this recipe.

Getting ready

To use the SklearnClassifier class, you must have scikit-learn installed. Instructions are available online at http://scikit-learn.org/stable/install.html. If you have all the dependencies installed, such...

Measuring precision and recall of a classifier

In addition to accuracy, there are a number of other metrics used to evaluate classifiers. Two of the most common are precision and recall. To understand these two metrics, we must first understand false positives and false negatives. False positives happen when a classifier classifies a feature set with a label it shouldn't have gotten. False negatives happen when a classifier doesn't assign a label to a feature set that should have it. In a binary classifier, these errors happen at the same time.

Here's an example: the classifier classifies a movie review as pos when it should have been neg. This counts as a false positive for the pos label, and a false negative for the neg label. If the classifier had correctly guessed neg, then it would count as a true positive for the neg label, and a true negative for the pos label.

How does this apply to precision and recall? Precision is the lack of false positives, and recall is the lack of false...

Calculating high information words

A high information word is a word that is strongly biased towards a single classification label. These are the kinds of words we saw when we called the show_most_informative_features() method on both the NaiveBayesClassifier class and the MaxentClassifier class. Somewhat surprisingly, the top words are different for both classifiers. This discrepancy is due to how each classifier calculates the significance of each feature, and it's actually beneficial to have these different methods as they can be combined to improve accuracy, as we will see in the next recipe, Combining classifiers with voting.

The low information words are words that are common to all labels. It may be counter-intuitive, but eliminating these words from the training data can actually improve accuracy, precision, and recall. The reason this works is that using only high information words reduces the noise and confusion of a classifier's internal model. If all the words/features are highly...

Combining classifiers with voting

One way to improve classification performance is to combine classifiers. The simplest way to combine multiple classifiers is to use voting, and choose whichever label gets the most votes. For this style of voting, it's best to have an odd number of classifiers so that there are no ties. This means combining at least three classifiers together. The individual classifiers should also use different algorithms; the idea is that multiple algorithms are better than one, and the combination of many can compensate for individual bias. However, combining a poorly performing classifier with better performing classifiers is generally not a good idea, because the poor performance of one classifier can bring the total accuracy down.

Getting ready

As we need to have at least three trained classifiers to combine, we are going to use a NaiveBayesClassifier class, a DecisionTreeClassifier class, and a MaxentClassifier class, all trained on the highest information words of...

Classifying with multiple binary classifiers

So far we have focused on binary classifiers, which classify with one of two possible labels. The same techniques for training a binary classifier can also be used to create a multi-class classifier, which is a classifier that can classify with one of the many possible labels. But there are also cases where you need to be able to classify with multiple labels. A classifier that can return more than one label is a multi-label classifier.

A common technique for creating a multi-label classifier is to combine many binary classifiers, one for each label. You train each binary classifier so that it either returns a known label or returns something else to signal that the label does not apply. Then, you can run all the binary classifiers on your feature set to collect all the applicable labels.

Getting ready

The reuters corpus contains multi-labeled text that we can use for training and evaluation:

>>> from nltk.corpus import reuters
>>...

Training a classifier with NLTK-Trainer

In this recipe, we'll cover the train_classifier.py script from NLTK-Trainer, which lets you train NLTK classifiers from the command line. NLTK-Trainer was previously introduced at the end of Chapter 4, Part-of-speech Tagging, and again at the end of Chapter 5, Extracting Chunks.

Note

You can find NLTK-Trainer at https://github.com/japerk/nltk-trainer and the online documentation at http://nltk-trainer.readthedocs.org/.

How to do it...

Like train_tagger.py and train_chunker.py, the only required argument for train_classifier.py is the name of a corpus. The corpus must have a categories() method, because text classification is all about learning to classify categories. Here's an example of running train_classifier.py on the movie_reviews corpus:

$ python train_classifier.py movie_reviews
loading movie_reviews
2 labels: ['neg', 'pos']
using bag of words feature extraction
2000 training feats, 2000 testing feats
training NaiveBayes classifier
accuracy: 0...

The rest of the chapter is locked

You have been reading a chapter from

Python 3 Text Processing with NLTK 3 Cookbook

Published in: Aug 2014Publisher: ISBN-13: 9781782167853

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at $15.99/month. Cancel anytime

Author (1)

Jacob Perkins

Jacob Perkins is the cofounder and CTO of Weotta, a local search company. Weotta uses NLP and machine learning to create powerful and easy-to-use natural language search for what to do and where to go. He is the author of Python Text Processing with NLTK 2.0 Cookbook, Packt Publishing, and has contributed a chapter to the Bad Data Handbook, O'Reilly Media. He writes about NLTK, Python, and other technology topics at http://streamhacker.com. To demonstrate the capabilities of NLTK and natural language processing, he developed http://text-processing.com, which provides simple demos and NLP APIs for commercial use. He has contributed to various open source projects, including NLTK, and created NLTK-Trainer to simplify the process of training NLTK models. For more information, visit https://github.com/japerk/nltk-trainer.
Read more about Jacob Perkins

Personalised recommendations for you

Based on your interests and search pattern

Et al.

Ever wonder why speech recognition systems don't understand the Scottish accent, or what would happen if an astronaut only ate mac 'n' cheese, or other spurious reflections you'd have at a bar? We did, then collated those deliberations into absurd research articles with fake figures and methodologies inspired by even more fictionally absurd studies.

BookAug 2023230 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages4

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages1

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Mastering Tableau 2023

This book is a comprehensive resource to mastering your Tableau skills and becoming a BI expert. As you progress, you will learn how to build advanced dashboards and improve your storytelling to derive key business insight, as well as make you well-versed with advanced functionalities of Tableau in the business intelligence domain.

BookAug 2023684 pages

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages5

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages2

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

BookOct 2023636 pages5

Modern Data Architecture on AWS

Every organization wants an agile, performant, and cost-effective data platform that meets all their current and future business needs. Purpose-built AWS analytics services and their features play a big part in building such a modern data platform. This book brings to you all the design and architectural patterns that’ll help you achieve this goal.

BookAug 2023420 pages5

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

BookDec 2023240 pages

TinyML Cookbook

With over 70 project-based recipes, the TinyML Cookbook is a practical guide that will help you to get the most out of your microcontrollers. It provides a comprehensive understanding of the theoretical foundations while giving you hands-on experience training ML models for deployment on Arduino Nano 33 BLE Sense, Raspberry Pi Pico, and SparkFun RedBoard Artemis Nano microcontrollers.

BookNov 2023664 pages