Reader small image

You're reading from  Python 3 Text Processing with NLTK 3 Cookbook

Product typeBook
Published inAug 2014
Reading LevelBeginner
Publisher
ISBN-139781782167853
Edition1st Edition
Languages
Tools
Right arrow
Author (1)
Jacob Perkins
Jacob Perkins
author image
Jacob Perkins

Jacob Perkins is the cofounder and CTO of Weotta, a local search company. Weotta uses NLP and machine learning to create powerful and easy-to-use natural language search for what to do and where to go. He is the author of Python Text Processing with NLTK 2.0 Cookbook, Packt Publishing, and has contributed a chapter to the Bad Data Handbook, O'Reilly Media. He writes about NLTK, Python, and other technology topics at http://streamhacker.com. To demonstrate the capabilities of NLTK and natural language processing, he developed http://text-processing.com, which provides simple demos and NLP APIs for commercial use. He has contributed to various open source projects, including NLTK, and created NLTK-Trainer to simplify the process of training NLTK models. For more information, visit https://github.com/japerk/nltk-trainer.
Read more about Jacob Perkins

Right arrow

Chapter 6. Transforming Chunks and Trees

In this chapter, we will cover the following recipes:

  • Filtering insignificant words from a sentence

  • Correcting verb forms

  • Swapping verb phrases

  • Swapping noun cardinals

  • Swapping infinitive phrases

  • Singularizing plural nouns

  • Chaining chunk transformations

  • Converting a chunk tree to text

  • Flattening a deep tree

  • Creating a shallow tree

  • Converting tree labels

Introduction


Now that you know how to get chunks/phrases from a sentence, what do you do with them? This chapter will show you how to do various transforms on both chunks and trees. The chunk transforms are for grammatical correction and rearranging phrases without loss of meaning. The tree transforms give you ways to modify and flatten deep parse trees. The functions detailed in these recipes modify data, as opposed to learning from it. This means it's not safe to apply them indiscriminately. A thorough knowledge of the data you want to transform, along with a few experiments, should help you decide which functions to apply and when.

Whenever the term chunk is used in this chapter, it could refer to an actual chunk extracted by a chunker, or it could simply refer to a short phrase or sentence in the form of a list of tagged words. What's important in this chapter is what you can do with a chunk, not where it came from.

Filtering insignificant words from a sentence


Many of the most commonly used words are insignificant when it comes to discerning the meaning of a phrase. For example, in the phrase the movie was terrible, the most significant words are movie and terrible, while the and was are almost useless. You could get the same meaning if you took them out, that is, movie terrible or terrible movie. Either way, the sentiment is the same. In this recipe, we'll learn how to remove the insignificant words and keep the significant ones by looking at their part-of-speech tags.

Getting ready

First, we need to decide which part-of-speech tags are significant and which are not. Looking through the treebank corpus for stopwords yields the following table of insignificant words and tags:

Word

Tag

a

DT

all

PDT

an

DT

and

CC

or

CC

that

WDT

the

DT

Other than CC, all the tags end with DT. This means we can filter out insignificant words by looking at the tag's suffix. Refer to Appendix A, Penn Treebank...

Correcting verb forms


It's fairly common to find incorrect verb forms in real-world language. For example, the correct form of is our children learning? is are our children learning? The verb is should only be used with singular nouns, while are is for plural nouns, such as children. We can correct these mistakes by creating verb correction mappings that are used depending on whether there's a plural or singular noun in the chunk.

Getting ready

We first need to define the verb correction mappings in transforms.py. We'll create two mappings, one for plural to singular and another for singular to plural:

plural_verb_forms = {
  ('is', 'VBZ'): ('are', 'VBP'),
  ('was', 'VBD'): ('were', 'VBD')
}

singular_verb_forms = {
  ('are', 'VBP'): ('is', 'VBZ'),
  ('were', 'VBD'): ('was', 'VBD')
}

Each mapping has a tagged verb that maps to another tagged verb. These initial mappings cover the basics of mapping is to are, was to were, and vice versa.

How to do it...

In transforms.py is a function called correct_verbs...

Swapping verb phrases


Swapping the words around a verb can eliminate the passive voice from particular phrases. For example, the book was great can be transformed into the great book. This kind of normalization can also help with frequency analysis, by counting two apparently different phrases as the same phrase.

How to do it...

In transforms.py is a function called swap_verb_phrase(). It swaps the right-hand side of the chunk with the left-hand side, using the verb as the pivot point. It uses the first_chunk_index() function defined in the previous recipe to find the verb to pivot around.

def swap_verb_phrase(chunk):
  def vbpred(wt):
    word, tag = wt
    return tag != 'VBG' and tag.startswith('VB') and len(tag) > 2

  vbidx = first_chunk_index(chunk, vbpred)

  if vbidx is None:
    return chunk

  return chunk[vbidx+1:] + chunk[:vbidx]

Now we can see how it works on the part-of-speech tagged phrase the book was great:

>>> swap_verb_phrase([('the', 'DT'), ('book', 'NN'), ('was...

Swapping noun cardinals


In a chunk, a cardinal word, tagged as CD, refers to a number, such as 10. These cardinals often occur before or after a noun. For normalization purposes, it can be useful to always put the cardinal before the noun.

How to do it...

The swap_noun_cardinal() function is defined in transforms.py. It swaps any cardinal that occurs immediately after a noun with the noun so that the cardinal occurs immediately before the noun. It uses a helper function, tag_equals(), which is similar to tag_startswith(), but in this case, the function it returns does an equality comparison with the given tag:

def tag_equals(tag):
  def f(wt):
    return wt[1] == tag
  return f

Now we can define swap_noun_cardinal():

def swap_noun_cardinal(chunk):
  cdidx = first_chunk_index(chunk, tag_equals('CD'))
  # cdidx must be > 0 and there must be a noun immediately before it
  if not cdidx or not chunk[cdidx-1][1].startswith('NN'):
    return chunk

  noun, nntag = chunk[cdidx-1]
  chunk[cdidx-1]...

Swapping infinitive phrases


An infinitive phrase has the form A of B, such as book of recipes. These can often be transformed into a new form while retaining the same meaning, such as recipes book.

How to do it...

An infinitive phrase can be found by looking for a word tagged with IN. The swap_infinitive_phrase() function, defined in transforms.py, will return a chunk that swaps the portion of the phrase after the IN word with the portion before the IN word:

def swap_infinitive_phrase(chunk):
  def inpred(wt):
    word, tag = wt
    return tag == 'IN' and word != 'like'

  inidx = first_chunk_index(chunk, inpred)

  if inidx is None:
    return chunk

  nnidx = first_chunk_index(chunk, tag_startswith('NN'), start=inidx, step=-1) or 0
  return chunk[:nnidx] + chunk[inidx+1:] + chunk[nnidx:inidx]

The function can now be used to transform book of recipes into recipes book:

>>> from transforms import swap_infinitive_phrase
>>> swap_infinitive_phrase([('book', 'NN'), ('of', 'IN'...

Singularizing plural nouns


As we saw in the previous recipe, the transformation process can result in phrases such as recipes book. This is a NNS followed by a NN, when a more proper version of the phrase would be recipe book, which is a NN followed by another NN. We can do another transform to correct these improper plural nouns.

How to do it...

The transforms.py script defines a function called singularize_plural_noun() which will depluralize a plural noun (tagged with NNS) that is followed by another noun:

def singularize_plural_noun(chunk):
  nnsidx = first_chunk_index(chunk, tag_equals('NNS'))

  if nnsidx is not None and nnsidx+1 < len(chunk) and chunk[nnsidx+1][1][:2] == 'NN':
    noun, nnstag = chunk[nnsidx]
    chunk[nnsidx] = (noun.rstrip('s'), nnstag.rstrip('S'))

  return chunk

And using it on recipes book, we get the more correct form, recipe book.

>>> singularize_plural_noun([('recipes', 'NNS'), ('book', 'NN')])
[('recipe', 'NN'), ('book', 'NN')]

How it works...

We start...

Chaining chunk transformations


The transform functions defined in the previous recipes can be chained together to normalize chunks. The resulting chunks are often shorter with no loss of meaning.

How to do it...

In transforms.py is the function transform_chunk(). It takes a single chunk and an optional list of transform functions. It calls each transform function on the chunk, one at a time, and returns the final chunk:

def transform_chunk(chunk, chain=[filter_insignificant, swap_verb_phrase, swap_infinitive_phrase, singularize_plural_noun], trace=0):
  for f in chain:
    chunk = f(chunk)

    if trace:
      print f.__name__, ':', chunk

  return chunk

Using it on the phrase the book of recipes is delicious, we get delicious recipe book:

>>> from transforms import transform_chunk
>>> transform_chunk([('the', 'DT'), ('book', 'NN'), ('of', 'IN'), ('recipes', 'NNS'), ('is', 'VBZ'), ('delicious', 'JJ')])
[('delicious', 'JJ'), ('recipe', 'NN'), ('book', 'NN')]

How it works...

The...

Converting a chunk tree to text


At some point, you may want to convert a Tree or subtree back to a sentence or chunk string. This is mostly straightforward, except when it comes to properly outputting punctuation.

How to do it...

We'll use the first tree of the treebank_chunk corpus as our example. The obvious first step is to join all the words in the tree with a space:

>>> from nltk.corpus import treebank_chunk
>>> tree = treebank_chunk.chunked_sents()[0]
>>> ' '.join([w for w, t in tree.leaves()])
'Pierre Vinken , 61 years old , will join the board as a nonexecutive director Nov. 29 .'

But as you can see, the punctuation isn't quite right. The commas and period are treated as individual words, and so get the surrounding spaces as well. But we can fix this using regular expression substitution. This is implemented in the chunk_tree_to_sent() function found in transforms.py:

import re
punct_re = re.compile(r'\s([,\.;\?])')

def chunk_tree_to_sent(tree, concat=' ')...

Flattening a deep tree


Some of the included corpora contain parsed sentences, which are often deep trees of nested phrases. Unfortunately, these trees are too deep to use for training a chunker, since IOB tag parsing is not designed for nested chunks. To make these trees usable for chunker training, we must flatten them.

Getting ready

We're going to use the first parsed sentence of the treebank corpus as our example. Here's a diagram showing how deeply nested this tree is:

You may notice that the part-of-speech tags are part of the tree structure instead of being included with the word. This will be handled later using the Tree.pos() method, which was designed specifically for combining words with preterminal Tree labels such as part-of-speech tags.

How to do it...

In transforms.py is a function named flatten_deeptree(). It takes a single Tree and will return a new Tree that keeps only the lowest-level trees. It uses a helper function, flatten_childtrees(), to do most of the work:

from nltk.tree...

Creating a shallow tree


In the previous recipe, we flattened a deep Tree by only keeping the lowest level subtrees. In this recipe, we'll keep only the highest level subtrees instead.

How to do it...

We'll be using the first parsed sentence from the treebank corpus as our example. Recall from the previous recipe that the sentence Tree looks like this:

The shallow_tree() function defined in transforms.py eliminates all the nested subtrees, keeping only the top subtree labels:

from nltk.tree import Tree

def shallow_tree(tree):
  children = []

  for t in tree:
    if t.height() < 3:
      children.extend(t.pos())
    else:
      children.append(Tree(t.label(), t.pos()))

  return Tree(tree.label(), children)

Using it on the first parsed sentence in treebank results in a Tree with only two subtrees:

>>> from transforms import shallow_tree
>>> shallow_tree(treebank.parsed_sents()[0])
Tree('S', [Tree('NP-SBJ', [('Pierre', 'NNP'), ('Vinken', 'NNP'), (',', ','), ('61', 'CD'), (...

Converting tree labels


As you've seen in previous recipes, parse trees often have a variety of Tree label types that are not present in chunk trees. If you want to use parse trees to train a chunker, then you'll probably want to reduce this variety by converting some of these tree labels to more common label types.

Getting ready

First, we have to decide which Tree labels need to be converted. Let's take a look at that first Tree again:

Immediately, you can see that there are two alternative NP subtrees: NP-SBJ and NP-TMP. Let's convert both of those to NP. The mapping will be as follows:

Original Label

New Label

NP-SBJ

NP

NP-TMP

NP

How to do it...

In transforms.py is the function convert_tree_labels(). It takes two arguments: the Tree to convert and a label conversion mapping. It returns a new Tree with all matching labels replaced based on the values in the mapping:

from nltk.tree import Tree

def convert_tree_labels(tree, mapping):
  children = []

  for t in tree:
    if isinstance(t...
lock icon
The rest of the chapter is locked
You have been reading a chapter from
Python 3 Text Processing with NLTK 3 Cookbook
Published in: Aug 2014Publisher: ISBN-13: 9781782167853
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Jacob Perkins

Jacob Perkins is the cofounder and CTO of Weotta, a local search company. Weotta uses NLP and machine learning to create powerful and easy-to-use natural language search for what to do and where to go. He is the author of Python Text Processing with NLTK 2.0 Cookbook, Packt Publishing, and has contributed a chapter to the Bad Data Handbook, O'Reilly Media. He writes about NLTK, Python, and other technology topics at http://streamhacker.com. To demonstrate the capabilities of NLTK and natural language processing, he developed http://text-processing.com, which provides simple demos and NLP APIs for commercial use. He has contributed to various open source projects, including NLTK, and created NLTK-Trainer to simplify the process of training NLTK models. For more information, visit https://github.com/japerk/nltk-trainer.
Read more about Jacob Perkins