Python Text Processing with NLTK 2.0 Cookbook

Python Text Processing with NLTK 2.0 Cookbook
eBook: $23.99
Formats: PDF, PacktLib, ePub and Mobi formats
save 15%!
Print + free eBook + free PacktLib access to the book: $63.98    Print cover: $39.99
save 44%!
Free Shipping!
UK, US, Europe and selected countries in Asia.
Also available on:
Table of Contents
Sample Chapters
  • Quickly get to grips with Natural Language Processing – with Text Analysis, Text Mining, and beyond
  • Learn how machines and crawlers interpret and process natural languages
  • Easily work with huge amounts of data and learn how to handle distributed processing
  • Part of Packt's Cookbook series: Each recipe is a carefully organized sequence of instructions to complete the task as efficiently as possible

Book Details

Language : English
Paperback : 272 pages [ 235mm x 191mm ]
Release Date : November 2010
ISBN : 1849513600
ISBN 13 : 9781849513609
Author(s) : Jacob Perkins
Topics and Technologies : All Books, Application Development, Cookbooks, Open Source

Table of Contents

Chapter 1: Tokenizing Text and WordNet Basics
Chapter 2: Replacing and Correcting Words
Chapter 3: Creating Custom Corpora
Chapter 4: Part-of-Speech Tagging
Chapter 5: Extracting Chunks
Chapter 6: Transforming Chunks and Trees
Chapter 7: Text Classification
Chapter 8: Distributed Processing and Handling Large Datasets
Chapter 9: Parsing Specific Data
Appendix: Penn Treebank Part-of-Speech Tags
  • Chapter 1: Tokenizing Text and WordNet Basics
    • Introduction
    • Tokenizing text into sentences
    • Tokenizing sentences into words
    • Tokenizing sentences using regular expressions
    • Filtering stopwords in a tokenized sentence
    • Looking up synsets for a word in WordNet
    • Looking up lemmas and synonyms in WordNet
    • Calculating WordNet synset similarity
    • Discovering word collocations
  • Chapter 2: Replacing and Correcting Words
    • Introduction
    • Stemming words
    • Lemmatizing words with WordNet
    • Translating text with Babelfish
    • Replacing words matching regular expressions
    • Removing repeating characters
    • Spelling correction with Enchant
    • Replacing synonyms
    • Replacing negations with antonyms
  • Chapter 3: Creating Custom Corpora
    • Introduction
    • Setting up a custom corpus
    • Creating a word list corpus
    • Creating a part-of-speech tagged word corpus
    • Creating a chunked phrase corpus
    • Creating a categorized text corpus
    • Creating a categorized chunk corpus reader
    • Lazy corpus loading
    • Creating a custom corpus view
    • Creating a MongoDB backed corpus reader
    • Corpus editing with file locking
  • Chapter 4: Part-of-Speech Tagging
    • Introduction
    • Default tagging
    • Training a unigram part-of-speech tagger
    • Combining taggers with backoff tagging
    • Training and combining Ngram taggers
    • Creating a model of likely word tags
    • Tagging with regular expressions
    • Affix tagging
    • Training a Brill tagger
    • Training the TnT tagger
    • Using WordNet for tagging
    • Tagging proper names
    • Classifier based tagging
  • Chapter 5: Extracting Chunks
    • Introduction
    • Chunking and chinking with regular expressions
    • Merging and splitting chunks with regular expressions
    • Expanding and removing chunks with regular expressions
    • Partial parsing with regular expressions
    • Training a tagger-based chunker
    • Classification-based chunking
    • Extracting named entities
    • Extracting proper noun chunks
    • Extracting location chunks
    • Training a named entity chunker
  • Chapter 6: Transforming Chunks and Trees
    • Introduction
    • Filtering insignificant words
    • Correcting verb forms
    • Swapping verb phrases
    • Swapping noun cardinals
    • Swapping infinitive phrases
    • Singularizing plural nouns
    • Chaining chunk transformations
    • Converting a chunk tree to text
    • Flattening a deep tree
    • Creating a shallow tree
    • Converting tree nodes
  • Chapter 7: Text Classification
    • Introduction
    • Bag of Words feature extraction
    • Training a naive Bayes classifier
    • Training a decision tree classifier
    • Training a maximum entropy classifier
    • Measuring precision and recall of a classifier
    • Calculating high information words
    • Combining classifiers with voting
    • Classifying with multiple binary classifiers
  • Chapter 8: Distributed Processing and Handling Large Datasets
    • Introduction
    • Distributed tagging with execnet
    • Distributed chunking with execnet
    • Parallel list processing with execnet
    • Storing a frequency distribution in Redis
    • Storing a conditional frequency distribution in Redis
    • Storing an ordered dictionary in Redis
    • Distributed word scoring with Redis and execnet
  • Chapter 9: Parsing Specific Data
    • Introduction
    • Parsing dates and times with Dateutil
    • Time zone lookup and conversion
    • Tagging temporal expressions with Timex
    • Extracting URLs from HTML with lxml
    • Cleaning and stripping HTML
    • Converting HTML entities with BeautifulSoup
    • Detecting and converting character encodings

Jacob Perkins

Jacob Perkins is the author of Packt’s Python Text Processing with NLTK 2.0 Cookbook, and a contributor to the Bad Data Handbook. He is the CTO and cofounder of Weotta, a natural-language-based search engine for local entertainment. He created, which demos NLTK functionality and provides natural language processing APIs. Jacob also writes about natural language processing and Python programming at and you can follow him on Twitter -

Sorry, we don't have any reviews for this title yet.

Code Downloads

Download the code and support files for this book.

Submit Errata

Please let us know if you have found any errors not listed on this list by completing our errata submission form. Our editors will check them and add them to this list. Thank you.


- 6 submitted: last submission 17 Feb 2014

Errata type: Others | Page number: 29

The WordNetLemmatizer is a thin wrapper around the WordNet corpus, and uses the morphy() function of the WordNetCorpusReader to fnd a lemma. If no lemma is found, or the word itself is a lemma, the word is returned as it is. Unlike with stemming, knowing the part of speech of the word is important. As demonstrated previously, "cooking" does not return a different lemma unless you specify that the part of speech (pos) is a verb. This is because the default part of speech is a noun, and as a noun, "cooking" is its own lemma. "Cookbooks", on the other hand, is a noun, and its lemma is the singular form, "cookbook".


Errata type: Others | Page number: 35

The replacement string is then used to keep all the matched groups, while discarding the backreference to the second group. So the word "looooove" gets split into


and then recombined as "loooove", discarding the last "o". This continues until only one "o" remains, when repeat_regexp no longer matches the string, and no more characters are removed.



Page: 40 | Errata Type: Code

First line of Code, "wordReplacer" should be "WordReplacer"

Page: 36 | Errata type: Code

Replace the last line of the code snippet:

self.max_dist = 2
self.max_dist = max_dist

Errata type: Code | Page number: 104


The following line should be added before the return statement in choose_tag(): if not fd: return None"
The code should be:

class WordNetTagger(SequentialBackoffTagger):
       def __init__(self, *args, **kwargs):
                SequentialBackoffTagger.__init__(self, *args, **kwargs)

                self.wordnet_tag_map = {
                        'n': 'NN',
                        's': 'JJ',
                        'a': 'JJ',
                        'r': 'RB',
                        'v': 'VB'

        def choose_tag(self, tokens, index, history):
                word = tokens[index]
                fd = FreqDist()

                for synset in wordnet.synsets(word):

                if not fd: return None
                return self.wordnet_tag_map.get(fd.max())

Errata type: Code | Page number: 216

NLTK 2.0.4 changed the internal implementation of ConditionalFreqDist, which
requires the RedisConditionalHashFreqDist to be updated:

class RedisConditionalHashFreqDist(ConditionalFreqDist):
def __init__(self, r, name, cond_samples=None):
self._r = r
self._name = name
ConditionalFreqDist.__init__(self, cond_samples)
# initialize self._fdists for all matching keys
for key in self._r.keys(encode_key('%s:*' % name)):
condition = key.split(':')[1]
self[condition] # calls self.__getitem__(condition)

def __contains__(self, condition):
return super(RedisConditionalHashFreqDist,

def __getitem__(self, condition):
if condition not in self:
key = '%s:%s' % (self._name, condition)
self[condition] = RedisHashFreqDist(self._r, key)

return super(RedisConditionalHashFreqDist, self).__getitem__(condition)

def clear(self):
for key, fdist in self.items():
del self[key]

Sample chapters

You can view our sample chapters and prefaces of this title on PacktLib or download sample chapters in PDF format.

Frequently bought together

Python Text Processing with NLTK 2.0 Cookbook +    Linux Shell Scripting Cookbook =
50% Off
the second eBook
Price for both: $34.95

Buy both these recommended eBooks together and get 50% off the cheapest eBook.

What you will learn from this book

  • Learn Text categorization and Topic identification
  • Learn Stemming and Lemmatization and how to go beyond the usual spell checker
  • Replace negations with antonyms in your text
  • Learn to tokenize words into lists of sentences and words, and gain an insight into WordNet
  • Transform and manipulate chunks and trees
  • Learn advanced features of corpus readers and create your own custom corpora
  • Tag different parts of speech by creating, training, and using a part-of-speech tagger
  • Improve accuracy by combining multiple part-of-speech taggers
  • Learn how to do partial parsing to extract small chunks of text from a part-of-speech tagged sentence
  • Produce an alternative canonical form without changing the meaning by normalizing parsed chunks
  • Learn how search engines use Natural Language Processing to process text
  • Make your site more discoverable by learning how to automatically replace words with more searched equivalents
  • Parse dates, times, and HTML
  • Train and manipulate different types of classifiers

In Detail

Natural Language Processing is used everywhere – in search engines, spell checkers, mobile phones, computer games – even your washing machine. Python's Natural Language Toolkit (NLTK) suite of libraries has rapidly emerged as one of the most efficient tools for Natural Language Processing. You want to employ nothing less than the best techniques in Natural Language Processing – and this book is your answer.

Python Text Processing with NLTK 2.0 Cookbook is your handy and illustrative guide, which will walk you through all the Natural Language Processing techniques in a step–by-step manner. It will demystify the advanced features of text analysis and text mining using the comprehensive NLTK suite.

This book cuts short the preamble and you dive right into the science of text processing with a practical hands-on approach.

Get started off with learning tokenization of text. Get an overview of WordNet and how to use it. Learn the basics as well as advanced features of Stemming and Lemmatization. Discover various ways to replace words with simpler and more common (read: more searched) variants. Create your own corpora and learn to create custom corpus readers for JSON files as well as for data stored in MongoDB. Use and manipulate POS taggers. Transform and normalize parsed chunks to produce a canonical form without changing their meaning. Dig into feature extraction and text classification. Learn how to easily handle huge amounts of data without any loss in efficiency or speed.

This book will teach you all that and beyond, in a hands-on learn-by-doing manner. Make yourself an expert in using the NLTK for Natural Language Processing with this handy companion.


The learn-by-doing approach of this book will enable you to dive right into the heart of text processing from the very first page. Each recipe is carefully designed to fulfill your appetite for Natural Language Processing. Packed with numerous illustrative examples and code samples, it will make the task of using the NLTK for Natural Language Processing easy and straightforward.

Who this book is for

This book is for Python programmers who want to quickly get to grips with using the NLTK for Natural Language Processing. Familiarity with basic text processing concepts is required. Programmers experienced in the NLTK will also find it useful. Students of linguistics will find it invaluable.

Code Download and Errata
Packt Anytime, Anywhere
Register Books
Print Upgrades
eBook Downloads
Video Support
Contact Us
Awards Voting Nominations Previous Winners
Judges Open Source CMS Hall Of Fame CMS Most Promising Open Source Project Open Source E-Commerce Applications Open Source JavaScript Library Open Source Graphics Software
Open Source CMS Hall Of Fame CMS Most Promising Open Source Project Open Source E-Commerce Applications Open Source JavaScript Library Open Source Graphics Software