Home Application-development Python Text Processing with NLTK 2.0 Cookbook

Python Text Processing with NLTK 2.0 Cookbook

By Jacob Perkins
books-svg-icon Book
Subscription
$10 p/m for first 3 months. $15.99 p/m after that. Cancel Anytime!
What do you get with a Packt Subscription?
This book & 7000+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with a Packt Subscription?
This book & 6500+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with eBook + Subscription?
Download this book in EPUB and PDF formats, plus a monthly download credit
This book & 6500+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with a Packt Subscription?
This book & 6500+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with eBook?
Download this book in EPUB and PDF formats
Access this title in our online reader
DRM FREE - Read whenever, wherever and however you want
Online reader with customised display settings for better reading experience
What do you get with video?
Download this video in MP4 format
Access this title in our online reader
DRM FREE - Watch whenever, wherever and however you want
Online reader with customised display settings for better learning experience
What do you get with video?
Stream this video
Access this title in our online reader
DRM FREE - Watch whenever, wherever and however you want
Online reader with customised display settings for better learning experience
What do you get with Audiobook?
Download a zip folder consisting of audio files (in MP3 Format) along with supplementary PDF
What do you get with Exam Trainer?
Flashcards, Mock exams, Exam Tips, Practice Questions
Access these resources with our interactive certification platform
Mobile compatible-Practice whenever, wherever, however you want
BUY NOW $10 p/m for first 3 months. $15.99 p/m after that. Cancel Anytime!
Subscription
What do you get with a Packt Subscription?
This book & 7000+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with a Packt Subscription?
This book & 6500+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with eBook + Subscription?
Download this book in EPUB and PDF formats, plus a monthly download credit
This book & 6500+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with a Packt Subscription?
This book & 6500+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with eBook?
Download this book in EPUB and PDF formats
Access this title in our online reader
DRM FREE - Read whenever, wherever and however you want
Online reader with customised display settings for better reading experience
What do you get with video?
Download this video in MP4 format
Access this title in our online reader
DRM FREE - Watch whenever, wherever and however you want
Online reader with customised display settings for better learning experience
What do you get with video?
Stream this video
Access this title in our online reader
DRM FREE - Watch whenever, wherever and however you want
Online reader with customised display settings for better learning experience
What do you get with Audiobook?
Download a zip folder consisting of audio files (in MP3 Format) along with supplementary PDF
What do you get with Exam Trainer?
Flashcards, Mock exams, Exam Tips, Practice Questions
Access these resources with our interactive certification platform
Mobile compatible-Practice whenever, wherever, however you want
  1. Free Chapter
    Tokenizing Text and WordNet Basics
About this book

Natural Language Processing is used everywhere – in search engines, spell checkers, mobile phones, computer games – even your washing machine. Python's Natural Language Toolkit (NLTK) suite of libraries has rapidly emerged as one of the most efficient tools for Natural Language Processing. You want to employ nothing less than the best techniques in Natural Language Processing – and this book is your answer.

Python Text Processing with NLTK 2.0 Cookbook is your handy and illustrative guide, which will walk you through all the Natural Language Processing techniques in a step–by-step manner. It will demystify the advanced features of text analysis and text mining using the comprehensive NLTK suite.

This book cuts short the preamble and you dive right into the science of text processing with a practical hands-on approach.

Get started off with learning tokenization of text. Get an overview of WordNet and how to use it. Learn the basics as well as advanced features of Stemming and Lemmatization. Discover various ways to replace words with simpler and more common (read: more searched) variants. Create your own corpora and learn to create custom corpus readers for JSON files as well as for data stored in MongoDB. Use and manipulate POS taggers. Transform and normalize parsed chunks to produce a canonical form without changing their meaning. Dig into feature extraction and text classification. Learn how to easily handle huge amounts of data without any loss in efficiency or speed.

This book will teach you all that and beyond, in a hands-on learn-by-doing manner. Make yourself an expert in using the NLTK for Natural Language Processing with this handy companion.

Publication date:
November 2010
Publisher
Packt
Pages
272
ISBN
9781849513609

 

Chapter 1. Tokenizing Text and WordNet Basics

In this chapter, we will cover:

  • Tokenizing text into sentences

  • Tokenizing sentences into words

  • Tokenizing sentences using regular expressions

  • Filtering stopwords in a tokenized sentence

  • Looking up synsets for a word in WordNet

  • Looking up lemmas and synonyms in WordNet

  • Calculating WordNet synset similarity

  • Discovering word collocations

 

Introduction


NLTK is the Natural Language Toolkit, a comprehensive Python library for natural language processing and text analytics. Originally designed for teaching, it has been adopted in the industry for research and development due to its usefulness and breadth of coverage.

This chapter will cover the basics of tokenizing text and using WordNet. Tokenization is a method of breaking up a piece of text into many pieces, and is an essential first step for recipes in later chapters.

WordNet is a dictionary designed for programmatic access by natural language processing systems. NLTK includes a WordNet corpus reader, which we will use to access and explore WordNet. We'll be using WordNet again in later chapters, so it's important to familiarize yourself with the basics first.

 

Tokenizing text into sentences


Tokenization is the process of splitting a string into a list of pieces, or tokens. We'll start by splitting a paragraph into a list of sentences.

Getting ready

Installation instructions for NLTK are available at http://www.nltk.org/download and the latest version as of this writing is 2.0b9. NLTK requires Python 2.4 or higher, but is not compatible with Python 3.0. The recommended Python version is 2.6.

Once you've installed NLTK, you'll also need to install the data by following the instructions at http://www.nltk.org/data. We recommend installing everything, as we'll be using a number of corpora and pickled objects. The data is installed in a data directory, which on Mac and Linux/Unix is usually /usr/share/nltk_data, or on Windows is C:\nltk_data. Make sure that tokenizers/punkt.zip is in the data directory and has been unpacked so that there's a file at tokenizers/punkt/english.pickle.

Finally, to run the code examples, you'll need to start a Python console. Instructions on how to do so are available at http://www.nltk.org/getting-started. For Mac with Linux/Unix users, you can open a terminal and type python.

How to do it...

Once NLTK is installed and you have a Python console running, we can start by creating a paragraph of text:

>>> para = "Hello World. It's good to see you. Thanks for buying this book."

Now we want to split para into sentences. First we need to import the sentence tokenization function, and then we can call it with the paragraph as an argument.

>>> from nltk.tokenize import sent_tokenize
>>> sent_tokenize(para)
['Hello World.', "It's good to see you.", 'Thanks for buying this book.']

So now we have a list of sentences that we can use for further processing.

How it works...

sent_tokenize uses an instance of PunktSentenceTokenizer from the nltk.tokenize.punkt module. This instance has already been trained on and works well for many European languages. So it knows what punctuation and characters mark the end of a sentence and the beginning of a new sentence.

There's more...

The instance used in sent_tokenize() is actually loaded on demand from a pickle file. So if you're going to be tokenizing a lot of sentences, it's more efficient to load the PunktSentenceTokenizer once, and call its tokenize() method instead.

>>> import nltk.data
>>> tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')
>>> tokenizer.tokenize(para)
['Hello World.', "It's good to see you.", 'Thanks for buying this book.']

Other languages

If you want to tokenize sentences in languages other than English, you can load one of the other pickle files in tokenizers/punkt and use it just like the English sentence tokenizer. Here's an example for Spanish:

>>> spanish_tokenizer = nltk.data.load('tokenizers/punkt/spanish.pickle')
>>> spanish_tokenizer.tokenize('Hola amigo. Estoy bien.')

See also

In the next recipe, we'll learn how to split sentences into individual words. After that, we'll cover how to use regular expressions for tokenizing text.

 

Tokenizing sentences into words


In this recipe, we'll split a sentence into individual words. The simple task of creating a list of words from a string is an essential part of all text processing.

How to do it...

Basic word tokenization is very simple: use the word_tokenize() function:

>>> from nltk.tokenize import word_tokenize
>>> word_tokenize('Hello World.')
['Hello', 'World', '.']

How it works...

word_tokenize() is a wrapper function that calls tokenize() on an instance of the TreebankWordTokenizer . It's equivalent to the following:

>>> from nltk.tokenize import TreebankWordTokenizer
>>> tokenizer = TreebankWordTokenizer()
>>> tokenizer.tokenize('Hello World.')
['Hello', 'World', '.']

It works by separating words using spaces and punctuation. And as you can see, it does not discard the punctuation, allowing you to decide what to do with it.

There's more...

Ignoring the obviously named WhitespaceTokenizer and SpaceTokenizer , there are two other word tokenizers worth looking at: PunktWordTokenizer and WordPunctTokenizer. These differ from the TreebankWordTokenizer by how they handle punctuation and contractions, but they all inherit from TokenizerI. The inheritance tree looks like this:

Contractions

TreebankWordTokenizer uses conventions found in the Penn Treebank corpus, which we'll be using for training in Chapter 4, Part-of-Speech Tagging and Chapter 5, Extracting Chunks. One of these conventions is to separate contractions. For example:

>>> word_tokenize("can't")
['ca', "n't"]

If you find this convention unacceptable, then read on for alternatives, and see the next recipe for tokenizing with regular expressions.

PunktWordTokenizer

An alternative word tokenizer is the PunktWordTokenizer. It splits on punctuation, but keeps it with the word instead of creating separate tokens.

>>> from nltk.tokenize import PunktWordTokenizer
>>> tokenizer = PunktWordTokenizer()
>>> tokenizer.tokenize("Can't is a contraction.")
['Can', "'t", 'is', 'a', 'contraction.']

WordPunctTokenizer

Another alternative word tokenizer is WordPunctTokenizer. It splits all punctuations into separate tokens.

>>> from nltk.tokenize import WordPunctTokenizer
>>> tokenizer = WordPunctTokenizer()
>>> tokenizer.tokenize("Can't is a contraction.")
['Can', "'", 't', 'is', 'a', 'contraction', '.']

See also

For more control over word tokenization, you'll want to read the next recipe to learn how to use regular expressions and the RegexpTokenizer for tokenization.

 

Tokenizing sentences using regular expressions


Regular expression can be used if you want complete control over how to tokenize text. As regular expressions can get complicated very quickly, we only recommend using them if the word tokenizers covered in the previous recipe are unacceptable.

Getting ready

First you need to decide how you want to tokenize a piece of text, as this will determine how you construct your regular expression. The choices are:

  • Match on the tokens

  • Match on the separators, or gaps

We'll start with an example of the first, matching alphanumeric tokens plus single quotes so that we don't split up contractions.

How to do it...

We'll create an instance of the RegexpTokenizer, giving it a regular expression string to use for matching tokens.

>>> from nltk.tokenize import RegexpTokenizer
>>> tokenizer = RegexpTokenizer("[\w']+")
>>> tokenizer.tokenize("Can't is a contraction.")
["Can't", 'is', 'a', 'contraction']

There's also a simple helper function you can use in case you don't want to instantiate the class.

>>> from nltk.tokenize import regexp_tokenize
>>> regexp_tokenize("Can't is a contraction.", "[\w']+")
["Can't", 'is', 'a', 'contraction']

Now we finally have something that can treat contractions as whole words, instead of splitting them into tokens.

How it works...

The RegexpTokenizer works by compiling your pattern, then calling re.findall() on your text. You could do all this yourself using the re module, but the RegexpTokenizer implements the TokenizerI interface, just like all the word tokenizers from the previous recipe. This means it can be used by other parts of the NLTK package, such as corpus readers, which we'll cover in detail in Chapter 3, Creating Custom Corpora. Many corpus readers need a way to tokenize the text they're reading, and can take optional keyword arguments specifying an instance of a TokenizerI subclass. This way, you have the ability to provide your own tokenizer instance if the default tokenizer is unsuitable.

There's more...

RegexpTokenizer can also work by matching the gaps, instead of the tokens. Instead of using re.findall(), the RegexpTokenizer will use re.split(). This is how the BlanklineTokenizer in nltk.tokenize is implemented.

Simple whitespace tokenizer

Here's a simple example of using the RegexpTokenizer to tokenize on whitespace:

>>> tokenizer = RegexpTokenizer('\s+', gaps=True)
>>> tokenizer.tokenize("Can't is a contraction.")
 ["Can't", 'is', 'a', 'contraction.']

Notice that punctuation still remains in the tokens.

See also

For simpler word tokenization, see the previous recipe.

 

Filtering stopwords in a tokenized sentence


Stopwords are common words that generally do not contribute to the meaning of a sentence, at least for the purposes of information retrieval and natural language processing. Most search engines will filter stopwords out of search queries and documents in order to save space in their index.

Getting ready

NLTK comes with a stopwords corpus that contains word lists for many languages. Be sure to unzip the datafile so NLTK can find these word lists in nltk_data/corpora/stopwords/.

How to do it...

We're going to create a set of all English stopwords, then use it to filter stopwords from a sentence.

>>> from nltk.corpus import stopwords
>>> english_stops = set(stopwords.words('english'))
>>> words = ["Can't", 'is', 'a', 'contraction']
>>> [word for word in words if word not in english_stops]
["Can't", 'contraction']

How it works...

The stopwords corpus is an instance of nltk.corpus.reader.WordListCorpusReader. As such, it has a words() method that can take a single argument for the file ID, which in this case is 'english', referring to a file containing a list of English stopwords. You could also call stopwords.words() with no argument to get a list of all stopwords in every language available.

There's more...

You can see the list of all English stopwords using stopwords.words('english') or by examining the word list file at nltk_data/corpora/stopwords/english. There are also stopword lists for many other languages. You can see the complete list of languages using the fileids() method:

>>> stopwords.fileids()
['danish', 'dutch', 'english', 'finnish', 'french', 'german', 'hungarian', 'italian', 'norwegian', 'portuguese', 'russian', 'spanish', 'swedish', 'turkish']

Any of these fileids can be used as an argument to the words() method to get a list of stopwords for that language.

See also

If you'd like to create your own stopwords corpus, see the Creating a word list corpus recipe in Chapter 3, Creating Custom Corpora, to learn how to use the WordListCorpusReader. We'll also be using stopwords in the Discovering word collocations recipe, later in this chapter.

 

Looking up synsets for a word in WordNet


WordNet is a lexical database for the English language. In other words, it's a dictionary designed specifically for natural language processing.

NLTK comes with a simple interface for looking up words in WordNet. What you get is a list of synset instances, which are groupings of synonymous words that express the same concept. Many words have only one synset, but some have several. We'll now explore a single synset, and in the next recipe, we'll look at several in more detail.

Getting ready

Be sure you've unzipped the wordnet corpus in nltk_data/corpora/wordnet. This will allow the WordNetCorpusReader to access it.

How to do it...

Now we're going to lookup the synset for cookbook, and explore some of the properties and methods of a synset.

>>> from nltk.corpus import wordnet
>>> syn = wordnet.synsets('cookbook')[0]
>>> syn.name
'cookbook.n.01'
>>> syn.definition
'a book of recipes and cooking directions'

How it works...

You can look up any word in WordNet using wordnet.synsets(word) to get a list of synsets. The list may be empty if the word is not found. The list may also have quite a few elements, as some words can have many possible meanings and therefore many synsets.

There's more...

Each synset in the list has a number of attributes you can use to learn more about it. The name attribute will give you a unique name for the synset, which you can use to get the synset directly.

>>> wordnet.synset('cookbook.n.01')
Synset('cookbook.n.01')

The definition attribute should be self-explanatory. Some synsets also have an examples attribute, which contains a list of phrases that use the word in context.

>>> wordnet.synsets('cooking')[0].examples
['cooking can be a great art', 'people are needed who have experience in cookery', 'he left the preparation of meals to his wife']

Hypernyms

Synsets are organized in a kind of inheritance tree. More abstract terms are known as hypernyms and more specific terms are hyponyms . This tree can be traced all the way up to a root hypernym.

Hypernyms provide a way to categorize and group words based on their similarity to each other. The synset similarity recipe details the functions used to calculate similarity based on the distance between two words in the hypernym tree.

>>> syn.hypernyms()
[Synset('reference_book.n.01')]
>>> syn.hypernyms()[0].hyponyms()
[Synset('encyclopedia.n.01'), Synset('directory.n.01'), Synset('source_book.n.01'), Synset('handbook.n.01'), Synset('instruction_book.n.01'), Synset('cookbook.n.01'), Synset('annual.n.02'), Synset('atlas.n.02'), Synset('wordbook.n.01')]
>>> syn.root_hypernyms()
[Synset('entity.n.01')]

As you can see, reference book is a hypernym of cookbook, but cookbook is only one of many hyponyms of reference book. All these types of books have the same root hypernym, entity, one of the most abstract terms in the English language. You can trace the entire path from entity down to cookbook using the hypernym_paths() method.

>>> syn.hypernym_paths()
[[Synset('entity.n.01'), Synset('physical_entity.n.01'), Synset('object.n.01'), Synset('whole.n.02'), Synset('artifact.n.01'), Synset('creation.n.02'), Synset('product.n.02'), Synset('work.n.02'), Synset('publication.n.01'), Synset('book.n.01'), Synset('reference_book.n.01'), Synset('cookbook.n.01')]]

This method returns a list of lists, where each list starts at the root hypernym and ends with the original Synset. Most of the time you'll only get one nested list of synsets.

Part-of-speech (POS)

You can also look up a simplified part-of-speech tag.

>>> syn.pos
'n'

There are four common POS found in WordNet.

Part-of-speech

Tag

Noun

n

Adjective

a

Adverb

r

Verb

v

These POS tags can be used for looking up specific synsets for a word. For example, the word great can be used as a noun or an adjective. In WordNet, great has one noun synset and six adjective synsets.

>>> len(wordnet.synsets('great'))
7
>>> len(wordnet.synsets('great', pos='n'))
1
>>> len(wordnet.synsets('great', pos='a'))
6

These POS tags will be referenced more in the Using WordNet for Tagging recipe of Chapter 4, Part-of-Speech Tagging.

See also

In the next two recipes, we'll explore lemmas and how to calculate synset similarity. In Chapter 2, Replacing and Correcting Words, we'll use WordNet for lemmatization, synonym replacement, and then explore the use of antonyms.

 

Looking up lemmas and synonyms in WordNet


Building on the previous recipe, we can also look up lemmas in WordNet to find synonyms of a word. A lemma (in linguistics) is the canonical form, or morphological form, of a word.

How to do it...

In the following block of code, we'll find that there are two lemmas for the cookbook synset by using the lemmas attribute:

>>> from nltk.corpus import wordnet
>>> syn = wordnet.synsets('cookbook')[0]
>>> lemmas = syn.lemmas
>>> len(lemmas)
2
>>> lemmas[0].name
'cookbook'
>>> lemmas[1].name
'cookery_book'
>>> lemmas[0].synset == lemmas[1].synset
True

How it works...

As you can see, cookery_book and cookbook are two distinct lemmas in the same synset. In fact, a lemma can only belong to a single synset. In this way, a synset represents a group of lemmas that all have the same meaning, while a lemma represents a distinct word form.

There's more...

Since lemmas in a synset all have the same meaning, they can be treated as synonyms. So if you wanted to get all synonyms for a synset, you could do:

>>> [lemma.name for lemma in syn.lemmas]
['cookbook', 'cookery_book']

All possible synonyms

As mentioned before, many words have multiple synsets because the word can have different meanings depending on the context. But let's say you didn't care about the context, and wanted to get all possible synonyms for a word.

>>> synonyms = []
>>> for syn in wordnet.synsets('book'):
...     for lemma in syn.lemmas:
...         synonyms.append(lemma.name)
>>> len(synonyms)
38

As you can see, there appears to be 38 possible synonyms for the word book. But in fact, some are verb forms, and many are just different usages of book. Instead, if we take the set of synonyms, there are fewer unique words.

>>> len(set(synonyms))
25

Antonyms

Some lemmas also have antonyms. The word good, for example, has 27 synsets, five of which have lemmas with antonyms.

>>> gn2 = wordnet.synset('good.n.02')
>>> gn2.definition
'moral excellence or admirableness'
>>> evil = gn2.lemmas[0].antonyms()[0]
>>> evil.name
'evil'
>>> evil.synset.definition
'the quality of being morally wrong in principle or practice'
>>> ga1 = wordnet.synset('good.a.01')
>>> ga1.definition
'having desirable or positive qualities especially those suitable for a thing specified'
>>> bad = ga1.lemmas[0].antonyms()[0]
>>> bad.name
'bad'
>>> bad.synset.definition
'having undesirable or negative qualities'

The antonyms() method returns a list of lemmas. In the first case here, we see that the second synset for good as a noun is defined as moral excellence, and its first antonym is evil, defined as morally wrong. In the second case, when good is used as an adjective to describe positive qualities, the first antonym is bad, which describes negative qualities.

See also

In the next recipe, we'll learn how to calculate synset similarity. Then in Chapter 2, Replacing and Correcting Words, we'll revisit lemmas for lemmatization, synonym replacement, and antonym replacement.

 

Calculating WordNet synset similarity


Synsets are organized in a hypernym tree. This tree can be used for reasoning about the similarity between the synsets it contains. Two synsets are more similar, the closer they are in the tree.

How to do it...

If you were to look at all the hyponyms of reference book (which is the hypernym of cookbook) you'd see that one of them is instruction_book. These seem intuitively very similar to cookbook, so let's see what WordNet similarity has to say about it.

>>> from nltk.corpus import wordnet
>>> cb = wordnet.synset('cookbook.n.01')
>>> ib = wordnet.synset('instruction_book.n.01')
>>> cb.wup_similarity(ib)
0.91666666666666663

So they are over 91% similar!

How it works...

wup_similarity is short for Wu-Palmer Similarity, which is a scoring method based on how similar the word senses are and where the synsets occur relative to each other in the hypernym tree. One of the core metrics used to calculate similarity is the shortest path distance between the two synsets and their common hypernym.

>>> ref = cb.hypernyms()[0]
>>> cb.shortest_path_distance(ref)
1
>>> ib.shortest_path_distance(ref)
1
>>> cb.shortest_path_distance(ib)
2

So cookbook and instruction book must be very similar, because they are only one step away from the same hypernym, reference book, and therefore only two steps away from each other.

There's more...

Let's look at two dissimilar words to see what kind of score we get. We'll compare dog with cookbook, two seemingly very different words.

>>> dog = wordnet.synsets('dog')[0]
>>> dog.wup_similarity(cb)
0.38095238095238093

Wow, dog and cookbook are apparently 38% similar! This is because they share common hypernyms farther up the tree.

>>> dog.common_hypernyms(cb)
[Synset('object.n.01'), Synset('whole.n.02'), Synset('physical_entity.n.01'), Synset('entity.n.01')]

Comparing verbs

The previous comparisons were all between nouns, but the same can be done for verbs as well.

>>> cook = wordnet.synset('cook.v.01')
>>> bake = wordnet.synset('bake.v.02')
>>> cook.wup_similarity(bake)
0.75

The previous synsets were obviously handpicked for demonstration, and the reason is that the hypernym tree for verbs has a lot more breadth and a lot less depth. While most nouns can be traced up to object, thereby providing a basis for similarity, many verbs do not share common hypernyms, making WordNet unable to calculate similarity. For example, if you were to use the synset for bake.v.01 here, instead of bake.v.02, the return value would be None. This is because the root hypernyms of the two synsets are different, with no overlapping paths. For this reason, you also cannot calculate similarity between words with different parts of speech.

Path and LCH similarity

Two other similarity comparisons are the path similarity and Leacock Chodorow (LCH) similarity.

>>> cb.path_similarity(ib)
0.33333333333333331
>>> cb.path_similarity(dog)
0.071428571428571425
>>> cb.lch_similarity(ib)
2.5389738710582761
>>> cb.lch_similarity(dog)
0.99852883011112725

As you can see, the number ranges are very different for these scoring methods, which is why we prefer the wup_similarity() method.

See also

The recipe on Looking up synsets for a word in WordNet, discussed earlier in this chapter, has more details about hypernyms and the hypernym tree.

 

Discovering word collocations


Collocations are two or more words that tend to appear frequently together, such as "United States". Of course, there are many other words that can come after "United", for example "United Kingdom", "United Airlines", and so on. As with many aspects of natural language processing, context is very important, and for collocations, context is everything!

In the case of collocations, the context will be a document in the form of a list of words. Discovering collocations in this list of words means that we'll find common phrases that occur frequently throughout the text. For fun, we'll start with the script for Monty Python and the Holy Grail.

Getting ready

The script for Monty Python and the Holy Grail is found in the webtext corpus, so be sure that it's unzipped in nltk_data/corpora/webtext/.

How to do it...

We're going to create a list of all lowercased words in the text, and then produce a BigramCollocationFinder, which we can use to find bigrams, which are pairs of words. These bigrams are found using association measurement functions found in the nltk.metrics package.

>>> from nltk.corpus import webtext
>>> from nltk.collocations import BigramCollocationFinder
>>> from nltk.metrics import BigramAssocMeasures
>>> words = [w.lower() for w in webtext.words('grail.txt')]
>>> bcf = BigramCollocationFinder.from_words(words)
>>> bcf.nbest(BigramAssocMeasures.likelihood_ratio, 4)
[("'", 's'), ('arthur', ':'), ('#', '1'), ("'", 't')]

Well that's not very useful! Let's refine it a bit by adding a word filter to remove punctuation and stopwords.

>>> from nltk.corpus import stopwords
>>> stopset = set(stopwords.words('english'))
>>> filter_stops = lambda w: len(w) < 3 or w in stopset
>>> bcf.apply_word_filter(filter_stops)
>>> bcf.nbest(BigramAssocMeasures.likelihood_ratio, 4)
[('black', 'knight'), ('clop', 'clop'), ('head', 'knight'), ('mumble', 'mumble')]

Much better—we can clearly see four of the most common bigrams in Monty Python and the Holy Grail. If you'd like to see more than four, simply increase the number to whatever you want, and the collocation finder will do its best.

How it works...

The BigramCollocationFinder constructs two frequency distributions: one for each word, and another for bigrams. A frequency distribution, or FreqDist in NLTK, is basically an enhanced dictionary where the keys are what's being counted, and the values are the counts. Any filtering functions that are applied, reduce the size of these two FreqDists by eliminating any words that don't pass the filter. By using a filtering function to eliminate all words that are one or two characters, and all English stopwords, we can get a much cleaner result. After filtering, the collocation finder is ready to accept a generic scoring function for finding collocations. Additional scoring functions are covered in the Scoring functions section further in this chapter.

There's more...

In addition to BigramCollocationFinder, there's also TrigramCollocationFinder, for finding triples instead of pairs. This time, we'll look for trigrams in Australian singles ads.

>>> from nltk.collocations import TrigramCollocationFinder
>>> from nltk.metrics import TrigramAssocMeasures
>>> words = [w.lower() for w in webtext.words('singles.txt')]
>>> tcf = TrigramCollocationFinder.from_words(words)
>>> tcf.apply_word_filter(filter_stops)
>>> tcf.apply_freq_filter(3)
>>> tcf.nbest(TrigramAssocMeasures.likelihood_ratio, 4)
[('long', 'term', 'relationship')]

Now, we don't know whether people are looking for a long-term relationship or not, but clearly it's an important topic. In addition to the stopword filter, we also applied a frequency filter which removed any trigrams that occurred less than three times. This is why only one result was returned when we asked for four—because there was only one result that occurred more than twice.

Scoring functions

There are many more scoring functions available besides likelihood_ratio(). But other than raw_freq(), you may need a bit of a statistics background to understand how they work. Consult the NLTK API documentation for NgramAssocMeasures in the nltk.metrics package, to see all the possible scoring functions.

Scoring ngrams

In addition to the nbest() method, there are two other ways to get ngrams (a generic term for describing bigrams and trigrams) from a collocation finder.

  1. above_score(score_fn, min_score) can be used to get all ngrams with scores that are at least min_score. The min_score that you choose will depend heavily on the score_fn you use.

  2. score_ngrams(score_fn) will return a list with tuple pairs of (ngram, score). This can be used to inform your choice for min_score in the previous step.

See also

The nltk.metrics module will be used again in Chapter 7, Text Classification.

About the Author
  • Jacob Perkins

    Jacob Perkins is the cofounder and CTO of Weotta, a local search company. Weotta uses NLP and machine learning to create powerful and easy-to-use natural language search for what to do and where to go.

    _x000D_

    He is the author of Python Text Processing with NLTK 2.0 Cookbook, Packt Publishing, and has contributed a chapter to the Bad Data Handbook, O'Reilly Media. He writes about NLTK, Python, and other technology topics at http://streamhacker.com.

    _x000D_

    To demonstrate the capabilities of NLTK and natural language processing, he developed http://text-processing.com, which provides simple demos and NLP APIs for commercial use. He has contributed to various open source projects, including NLTK, and created NLTK-Trainer to simplify the process of training NLTK models. For more information, visit https://github.com/japerk/nltk-trainer.

    Browse publications by this author
Python Text Processing with NLTK 2.0 Cookbook
Unlock this book and the full library FREE for 7 days
Start now