Python Text Processing with NLTK 2: Transforming Chunks and Trees

Jacob Perkins

December 2010


Python Text Processing with NLTK 2.0 Cookbook

Python Text Processing with NLTK 2.0 Cookbook

Use Python's NLTK suite of libraries to maximize your Natural Language Processing capabilities.

  • Quickly get to grips with Natural Language Processing – with Text Analysis, Text Mining, and beyond
  • Learn how machines and crawlers interpret and process natural languages
  • Easily work with huge amounts of data and learn how to handle distributed processing
  • Part of Packt's Cookbook series: Each recipe is a carefully organized sequence of instructions to complete the task as efficiently as possible
        Read more about this book      

(For more resources on Python, see here.)


This article will show you how to do various transforms on both chunks and trees. The chunk transforms are for grammatical correction and rearranging phrases without loss of meaning. The tree transforms give you ways to modify and flatten deep parse trees.

The functions detailed in these recipes modify data, as opposed to learning from it. That means it's not safe to apply them indiscriminately. A thorough knowledge of the data you want to transform, along with a few experiments, should help you decide which functions to apply and when.

Whenever the term chunk is used in this article, it could refer to an actual chunk extracted by a chunker, or it could simply refer to a short phrase or sentence in the form of a list of tagged words. What's important in this article is what you can do with a chunk, not where it came from.

Filtering insignificant words

Many of the most commonly used words are insignificant when it comes to discerning the meaning of a phrase. For example, in the phrase "the movie was terrible", the most significant words are "movie" and "terrible", while "the" and "was" are almost useless. You could get the same meaning if you took them out, such as "movie terrible" or "terrible movie". Either way, the sentiment is the same. In this recipe, we'll learn how to remove the insignificant words, and keep the significant ones, by looking at their part-of-speech tags.

Getting ready

First, we need to decide which part-of-speech tags are significant and which are not. Looking through the treebank corpus for stopwords yields the following table of insignificant words and tags:

















Other than CC, all the tags end with DT. This means we can filter out insignificant words by looking at the tag's suffix.

How to do it...

In there is a function called filter_insignificant(). It takes a single chunk, which should be a list of tagged words, and returns a new chunk without any insignificant tagged words. It defaults to filtering out any tags that end with DT or CC.

def filter_insignificant(chunk, tag_suffixes=['DT', 'CC']):
good = []

for word, tag in chunk:
ok = True

for suffix in tag_suffixes:
if tag.endswith(suffix):
ok = False

if ok:
good.append((word, tag))

return good

Now we can use it on the part-of-speech tagged version of "the terrible movie".

>>> from transforms import filter_insignificant
>>> filter_insignificant([('the', 'DT'), ('terrible', 'JJ'), ('movie',
[('terrible', 'JJ'), ('movie', 'NN')]

As you can see, the word "the" is eliminated from the chunk.

How it works...

filter_insignificant() iterates over the tagged words in the chunk. For each tag, it checks if that tag ends with any of the tag_suffixes. If it does, then the tagged word is skipped. However if the tag is ok, then the tagged word is appended to a new good chunk that is returned.

There's more...

The way filter_insignificant() is defined, you can pass in your own tag suffixes if DT and CC are not enough, or are incorrect for your case. For example, you might decide that possessive words and pronouns such as "you", "your", "their", and "theirs" are no good but DT and CC words are ok. The tag suffixes would then be PRP and PRP$. Following is an example of this function:

>>> filter_insignificant([('your', 'PRP$'), ('book', 'NN'), ('is',
'VBZ'), ('great', 'JJ')], tag_suffixes=['PRP', 'PRP$'])
[('book', 'NN'), ('is', 'VBZ'), ('great', 'JJ')]

Filtering insignificant words can be a good complement to stopword filtering for purposes such as search engine indexing, querying, and text classification.

Correcting verb forms

It's fairly common to find incorrect verb forms in real-world language. For example, the correct form of "is our children learning?" is "are our children learning?". The verb "is" should only be used with singular nouns, while "are" is for plural nouns, such as "children". We can correct these mistakes by creating verb correction mappings that are used depending on whether there's a plural or singular noun in the chunk.

Getting ready

We first need to define the verb correction mappings in We'll create two mappings, one for plural to singular, and another for singular to plural.

plural_verb_forms = {
('is', 'VBZ'): ('are', 'VBP'),
('was', 'VBD'): ('were', 'VBD')

singular_verb_forms = {
('are', 'VBP'): ('is', 'VBZ'),
('were', 'VBD'): ('was', 'VBD')

Each mapping has a tagged verb that maps to another tagged verb. These initial mappings cover the basics of mapping, is to are, was to were, and vice versa.

How to do it...

In there is a function called correct_verbs(). Pass it a chunk with incorrect verb forms, and you'll get a corrected chunk back. It uses a helper function first_chunk_index() to search the chunk for the position of the first tagged word where pred returns True.

def first_chunk_index(chunk, pred, start=0, step=1):
l = len(chunk)
end = l if step > 0 else -1

for i in range(start, end, step):
if pred(chunk[i]):
return i

return None

def correct_verbs(chunk):
vbidx = first_chunk_index(chunk, lambda (word, tag): tag.
# if no verb found, do nothing
if vbidx is None:
return chunk

verb, vbtag = chunk[vbidx]
nnpred = lambda (word, tag): tag.startswith('NN')
# find nearest noun to the right of verb
nnidx = first_chunk_index(chunk, nnpred, start=vbidx+1)
# if no noun found to right, look to the left
if nnidx is None:
nnidx = first_chunk_index(chunk, nnpred, start=vbidx-1, step=-1)
# if no noun found, do nothing
if nnidx is None:
return chunk

noun, nntag = chunk[nnidx]
# get correct verb form and insert into chunk
if nntag.endswith('S'):
chunk[vbidx] = plural_verb_forms.get((verb, vbtag), (verb, vbtag))
chunk[vbidx] = singular_verb_forms.get((verb, vbtag), (verb,

return chunk

When we call it on a part-of-speech tagged "is our children learning" chunk, we get back the correct form, "are our children learning".

>>> from transforms import correct_verbs
>>> correct_verbs([('is', 'VBZ'), ('our', 'PRP$'), ('children',
'NNS'), ('learning', 'VBG')])
[('are', 'VBP'), ('our', 'PRP$'), ('children', 'NNS'), ('learning',

We can also try this with a singular noun and an incorrect plural verb.

>>> correct_verbs([('our', 'PRP$'), ('child', 'NN'), ('were', 'VBD'),
('learning', 'VBG')])
[('our', 'PRP$'), ('child', 'NN'), ('was', 'VBD'), ('learning',

In this case, "were" becomes "was" because "child" is a singular noun.

How it works...

The correct_verbs() function starts by looking for a verb in the chunk. If no verb is found, the chunk is returned with no changes. Once a verb is found, we keep the verb, its tag, and its index in the chunk. Then we look on either side of the verb to find the nearest noun, starting on the right, and only looking to the left if no noun is found on the right. If no noun is found at all, the chunk is returned as is. But if a noun is found, then we lookup the correct verb form depending on whether or not the noun is plural.

Plural nouns are tagged with NNS, while singular nouns are tagged with NN. This means we can check the plurality of a noun by seeing if its tag ends with S. Once we get the corrected verb form, it is inserted into the chunk to replace the original verb form.

To make searching through the chunk easier, we define a function called first_chunk_ index(). It takes a chunk, a lambda predicate, the starting index, and a step increment. The predicate function is called with each tagged word until it returns True. If it never returns True, then None is returned. The starting index defaults to zero and the step increment to one. As you'll see in upcoming recipes, we can search backwards by overriding start and setting step to -1. This small utility function will be a key part of subsequent transform functions.

Swapping verb phrases

Swapping the words around a verb can eliminate the passive voice from particular phrases. For example, "the book was great" can be transformed into "the great book".

How to do it...

In there is a function called swap_verb_phrase(). It swaps the right-hand side of the chunk with the left-hand side, using the verb as the pivot point. It uses the first_chunk_index() function defined in the previous recipe to find the verb to pivot around.

def swap_verb_phrase(chunk):
# find location of verb
vbpred = lambda (word, tag): tag != 'VBG' and tag.startswith('VB')
and len(tag) > 2
vbidx = first_chunk_index(chunk, vbpred)

if vbidx is None:
return chunk
return chunk[vbidx+1:] + chunk[:vbidx]

Now we can see how it works on the part-of-speech tagged phrase "the book was great".

>>> from transforms import swap_verb_phrase
>>> swap_verb_phrase([('the', 'DT'), ('book', 'NN'), ('was', 'VBD'),
('great', 'JJ')])
[('great', 'JJ'), ('the', 'DT'), ('book', 'NN')]

The result is "great the book". This phrase clearly isn't grammatically correct, so read on to learn how to fix it.

How it works...

Using first_chunk_index() from the previous recipe, we start by finding the first matching verb that is not a gerund (a word that ends in "ing") tagged with VBG. Once we've found the verb, we return the chunk with the right side before the left, and remove the verb.

The reason we don't want to pivot around a gerund is that gerunds are commonly used to describe nouns, and pivoting around one would remove that description. Here's an example where you can see how not pivoting around a gerund is a good thing:

>>> swap_verb_phrase([('this', 'DT'), ('gripping', 'VBG'), ('book',
'NN'), ('is', 'VBZ'), ('fantastic', 'JJ')])
[('fantastic', 'JJ'), ('this', 'DT'), ('gripping', 'VBG'), ('book',

If we had pivoted around the gerund, the result would be "book is fantastic this", and we'd lose the gerund "gripping".

There's more...

Filtering insignificant words makes the final result more readable. By filtering either before or after swap_verb_phrase(), we get "fantastic gripping book" instead of "fantastic this gripping book".

>>> from transforms import swap_verb_phrase, filter_insignificant
>>> swap_verb_phrase(filter_insignificant([('this', 'DT'),
('gripping', 'VBG'), ('book', 'NN'), ('is', 'VBZ'), ('fantastic',
[('fantastic', 'JJ'), ('gripping', 'VBG'), ('book', 'NN')]
>>> filter_insignificant(swap_verb_phrase([('this', 'DT'),
('gripping', 'VBG'), ('book', 'NN'), ('is', 'VBZ'), ('fantastic',
[('fantastic', 'JJ'), ('gripping', 'VBG'), ('book', 'NN')]

Either way, we get a shorter grammatical chunk with no loss of meaning.

        Read more about this book      

(For more resources on Python, see here.)

Swapping noun cardinals

In a chunk, a cardinal word—tagged as CD—refers to a number, such as "10". These cardinals often occur before or after a noun. For normalization purposes, it can be useful to always put the cardinal before the noun.

How to do it...

The function swap_noun_cardinal() is defined in It swaps any cardinal that occurs immediately after a noun with the noun, so that the cardinal occurs immediately before the noun.

def swap_noun_cardinal(chunk):
cdidx = first_chunk_index(chunk, lambda (word, tag): tag == 'CD')
# cdidx must be > 0 and there must be a noun immediately before it
if not cdidx or not chunk[cdidx-1][1].startswith('NN'):
return chunk

noun, nntag = chunk[cdidx-1]
chunk[cdidx-1] = chunk[cdidx]
chunk[cdidx] = noun, nntag
return chunk

Let's try it on a date, such as "Dec 10", and another common phrase "the top 10".

>>> from transforms import swap_noun_cardinal
>>> swap_noun_cardinal([('Dec.', 'NNP'), ('10', 'CD')])
[('10', 'CD'), ('Dec.', 'NNP')]
>>> swap_noun_cardinal([('the', 'DT'), ('top', 'NN'), ('10', 'CD')])
[('the', 'DT'), ('10', 'CD'), ('top', 'NN')]

The result is that the numbers are now in front of the noun, creating "10 Dec" and "the 10 top".

How it works...

We start by looking for a CD tag in the chunk. If no CD is found, or if the CD is at the beginning of the chunk, then the chunk is returned as is. There must also be a noun immediately before the CD. If we do find a CD with a noun preceding it, then we swap the noun and cardinal in place.

Swapping infinitive phrases

An infinitive phrase has the form "A of B", such as "book of recipes". These can often be transformed into a new form while retaining the same meaning, such as "recipes book".

How to do it...

An infinitive phrase can be found by looking for a word tagged with IN. The function swap_infinitive_phrase(), defined in, will return a chunk that swaps the portion of the phrase after the IN word with the portion before the IN word.

def swap_infinitive_phrase(chunk):
inpred = lambda (word, tag): tag == 'IN' and word != 'like'
inidx = first_chunk_index(chunk, inpred)
if inidx is None:
return chunk

nnpred = lambda (word, tag): tag.startswith('NN')
nnidx = first_chunk_index(chunk, nnpred, start=inidx, step=-1) or 0

return chunk[:nnidx] + chunk[inidx+1:] + chunk[nnidx:inidx]

The function can now be used to transform "book of recipes" into "recipes book".

>>> from transforms import swap_infinitive_phrase
>>> swap_infinitive_phrase([('book', 'NN'), ('of', 'IN'), ('recipes',
[('recipes', 'NNS'), ('book', 'NN')]

How it works...

This function is similar to the swap_verb_phrase() function described in the Swapping verb phrases recipe. The inpred lambda is passed to first_chunk_index() to look for a word whose tag is IN. Next, nnpred is used to find the first noun that occurs before the IN word, so we can insert the portion of the chunk after the IN word between the noun and the beginning of the chunk. A more complicated example should demonstrate this:

>>> swap_infinitive_phrase([('delicious', 'JJ'), ('book', 'NN'),
('of', 'IN'), ('recipes', 'NNS')])
[('delicious', 'JJ'), ('recipes', 'NNS'), ('book', 'NN')]

We don't want the result to be "recipes delicious book". Instead, we want to insert "recipes" before the noun "book", but after the adjective "delicious". Hence, the need to find the nnidx occurring before the inidx.

There's more...

You'll notice that the inpred lambda checks to make sure the word is not "like". That's because "like" phrases must be treated differently, as transforming them the same way will result in an ungrammatical phrase. For example, "tastes like chicken" should not be transformed into "chicken tastes":

>>> swap_infinitive_phrase([('tastes', 'VBZ'), ('like', 'IN'),
('chicken', 'NN')])
[('tastes', 'VBZ'), ('like', 'IN'), ('chicken', 'NN')]

Singularizing plural nouns

As we saw in the previous recipe, the transformation process can result in phrases such as "recipes book". This is a NNS followed by an NN, when a more proper version of the phrase would be "recipe book", which is an NN followed by another NN. We can do another transform to correct these improper plural nouns.

How to do it... defines a function called singularize_plural_noun(), which will de-pluralize a plural noun (tagged with NNS) that is followed by another noun.

def singularize_plural_noun(chunk):
nnspred = lambda (word, tag): tag == 'NNS'
nnsidx = first_chunk_index(chunk, nnspred)

if nnsidx is not None and nnsidx+1 < len(chunk) and chunk[nnsidx+1]
[1][:2] == 'NN':
noun, nnstag = chunk[nnsidx]
chunk[nnsidx] = (noun.rstrip('s'), nnstag.rstrip('S'))

return chunk

Using it on "recipes book", we get the more correct form, "recipe book".

>>> from transforms import singularize_plural_noun
>>> singularize_plural_noun([('recipes', 'NNS'), ('book', 'NN')])
[('recipe', 'NN'), ('book', 'NN')]

How it works...

We start by looking for a plural noun with the tag NNS. If found, and if the next word is a noun (determined by making sure the tag starts with NN), then we de-pluralize the plural noun by removing an "s" from the right side of both the tag and the word.

The tag is assumed to be capitalized, so an uppercase "S" is removed from the right side of the tag, while a lowercase "s" is removed from the right side of the word.

Chaining chunk transformations

The transform functions defined in the previous recipes can be chained together to normalize chunks. The resulting chunks are often shorter with no loss of meaning.

How to do it...

In is the function transform_chunk(). It takes a single chunk and an optional list of transform functions. It calls each transform function on the chunk, one at a time, and returns the final chunk.

def transform_chunk(chunk, chain=[filter_insignificant, swap_verb_
phrase, swap_infinitive_phrase, singularize_plural_noun], trace=0):
for f in chain:
chunk = f(chunk)

if trace:
print f.__name__, ':', chunk

return chunk

Using it on the phrase "the book of recipes is delicious", we get "delicious recipe book":

>>> from transforms import transform_chunk
>>> transform_chunk([('the', 'DT'), ('book', 'NN'), ('of', 'IN'),
('recipes', 'NNS'), ('is', 'VBZ'), ('delicious', 'JJ')])
[('delicious', 'JJ'), ('recipe', 'NN'), ('book', 'NN')]

How it works...

The transform_chunk() function defaults to chaining the following functions in order:

  • filter_insignificant()
  • swap_verb_phrase()
  • swap_infinitive_phrase()
  • singularize_plural_noun()

Each function transforms the chunk that results from the previous function, starting with the original chunk.

The order in which you apply transform functions can be significant. Experiment with your own data to determine which transforms are best, and in which order they should be applied.

There's more...

You can pass trace=1 into transform_chunk() to get an output at each step.

>>> from transforms import transform_chunk
>>> transform_chunk([('the', 'DT'), ('book', 'NN'), ('of', 'IN'),
('recipes', 'NNS'), ('is', 'VBZ'), ('delicious', 'JJ')], trace=1)
filter_insignificant : [('book', 'NN'), ('of', 'IN'), ('recipes',
'NNS'), ('is', 'VBZ'), ('delicious', 'JJ')]
swap_verb_phrase : [('delicious', 'JJ'), ('book', 'NN'), ('of', 'IN'),
('recipes', 'NNS')]
swap_infinitive_phrase : [('delicious', 'JJ'), ('recipes', 'NNS'),
('book', 'NN')]
singularize_plural_noun : [('delicious', 'JJ'), ('recipe', 'NN'),
('book', 'NN')]
[('delicious', 'JJ'), ('recipe', 'NN'), ('book', 'NN')]

This shows you the result of each transform function, which is then passed in to the next transform function until a final chunk is returned.

        Read more about this book      

(For more resources on Python, see here.)

Converting a chunk tree to text

At some point, you may want to convert a Tree or sub-tree back to a sentence or chunk string. This is mostly straightforward, except when it comes to properly outputting punctuation.

How to do it...

We'll use the first Tree of the treebank_chunk as our example. The obvious first step is to join all the words in the tree with a space.

>>> from nltk.corpus import treebank_chunk
>>> tree = treebank_chunk.chunked_sents()[0]
>>> ' '.join([w for w, t in tree.leaves()])
'Pierre Vinken , 61 years old , will join the board as a nonexecutive
director Nov. 29 .'

As you can see, the punctuation isn't quite right. The commas and period are treated as individual words, and so get the surrounding spaces as well. We can fix this using regular expression substitution. This is implemented in the chunk_tree_to_sent() function found in

import re
punct_re = re.compile(r'\s([,\.;\?])')

def chunk_tree_to_sent(tree, concat=' '):
s = concat.join([w for w, t in tree.leaves()])
return re.sub(punct_re, r'\g<1>', s)

Using this function results in a much cleaner sentence, with no space before each punctuation mark:

>>> from transforms import chunk_tree_to_sent
>>> chunk_tree_to_sent(tree)
'Pierre Vinken, 61 years old, will join the board as a nonexecutive
director Nov. 29.'

How it works...

To correct the extra spaces in front of the punctuation, we create a regular expression punct_re that will match a space followed by any of the known punctuation characters. We have to escape both '.' and '?' with a '\' since they are special characters. The punctuation is surrounded by parenthesis so we can use the matched group for substitution.

Once we have our regular expression, we define chunk_tree_to_sent(), whose first step is to join the words by a concatenation character that defaults to a space. Then we can call re.sub() to replace all the punctuation matches with just the punctuation group. This eliminates the space in front of the punctuation characters, resulting in a more correct string.

There's more...

We can simplify this function a little by using nltk.tag.untag() to get words from the tree's leaves, instead of using our own list comprehension.

import nltk.tag, re
punct_re = re.compile(r'\s([,\.;\?])')

def chunk_tree_to_sent(tree, concat=' '):
s = concat.join(nltk.tag.untag(tree.leaves()))
return re.sub(punct_re, r'\g<1>', s)

Flattening a deep tree

Some of the included corpora contain parsed sentences, which are often deep trees of nested phrases. Unfortunately, these trees are too deep to use for training a chunker, since IOB tag parsing is not designed for nested chunks. To make these trees usable for chunker training, we must flatten them.

Getting ready

We're going to use the first parsed sentence of the treebank corpus as our example. Here's a diagram showing how deeply nested this tree is:

You may notice that the part-of-speech tags are part of the tree structure, instead of being included with the word. This will be handled next using the Tree.pos() method, which was designed specifically for combining words with pre-terminal Tree nodes such as part-of-speech tags.

How to do it...

In there is a function named flatten_deeptree(). It takes a single Tree and will return a new Tree that keeps only the lowest level trees. It uses a helper function flatten_childtrees() to do most of the work.

from nltk.tree import Tree

def flatten_childtrees(trees):
children = []
for t in trees:
if t.height() < 3:
elif t.height() == 3:
children.append(Tree(t.node, t.pos()))
children.extend(flatten_childtrees([c for c in t]))

return children

def flatten_deeptree(tree):
return Tree(tree.node, flatten_childtrees([c for c in tree]))

We can use it on the first parsed sentence of the treebank corpus to get a flatter tree:

>>> from nltk.corpus import treebank
>>> from transforms import flatten_deeptree
>>> flatten_deeptree(treebank.parsed_sents()[0])
Tree('S', [Tree('NP', [('Pierre', 'NNP'), ('Vinken', 'NNP')]), (',',
','), Tree('NP', [('61', 'CD'), ('years', 'NNS')]), ('old', 'JJ'),
(',', ','), ('will', 'MD'), ('join', 'VB'), Tree('NP', [('the',
'DT'), ('board', 'NN')]), ('as', 'IN'), Tree('NP', [('a', 'DT'),
('nonexecutive', 'JJ'), ('director', 'NN')]), Tree('NP-TMP', [('Nov.',
'NNP'), ('29', 'CD')]), ('.', '.')])

The result is a much flatter Tree that only includes NP phrases. Words that are not part of a NP phrase are separated. This flatter tree is shown as follows:

This Tree is quite similar to the first chunk Tree from the treebank_chunk corpus. The main difference is that the rightmost NP Tree is separated into two sub-trees in the previous diagram, one of them named NP-TMP.

The first tree from treebank_chunk is shown as follows for comparison:

How it works...

The solution is composed of two functions: flatten_deeptree() returns a new Tree from the given tree by calling flatten_childtrees() on each of the given tree's children.

flatten_childtrees() is a recursive function that drills down into the Tree until it finds child trees whose height() is equal to or less than three. A Tree whose height() is less than three looks like this:

>>> from nltk.tree import Tree
>>> Tree('NNP', ['Pierre']).height()

These short trees are converted into lists of tuples using the pos() function.

>>> Tree('NNP', ['Pierre']).pos()
[('Pierre', 'NNP')]

Trees whose height() is equal to three are the lowest level trees that we're interested in keeping. These trees look like this:

>>> Tree('NP', [Tree('NNP', ['Pierre']), Tree('NNP', ['Vinken'])]).

When we call pos() on that tree, we get:

>>> Tree('NP', [Tree('NNP', ['Pierre']), Tree('NNP', ['Vinken'])]).
[('Pierre', 'NNP'), ('Vinken', 'NNP')]

The recursive nature of flatten_childtrees() eliminates all trees whose height is greater than three.

There's more...

Flattening a deep Tree allows us to call nltk.chunk.util.tree2conlltags() on the flattened Tree, a necessary step to train a chunker. If you try to call this function before flattening the Tree, you get a ValueError exception.

>>> from nltk.chunk.util import tree2conlltags
>>> tree2conlltags(treebank.parsed_sents()[0])
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python2.6/dist-packages/nltk/chunk/",
line 417, in tree2conlltags
raise ValueError, "Tree is too deeply nested to be printed in
CoNLL format"
ValueError: Tree is too deeply nested to be printed in CoNLL format

However, after flattening there's no problem:

>>> tree2conlltags(flatten_deeptree(treebank.parsed_sents()[0]))
[('Pierre', 'NNP', 'B-NP'), ('Vinken', 'NNP', 'I-NP'), (',', ',',
'O'), ('61', 'CD', 'B-NP'), ('years', 'NNS', 'I-NP'), ('old', 'JJ',
'O'), (',', ',', 'O'), ('will', 'MD', 'O'), ('join', 'VB', 'O'),
('the', 'DT', 'B-NP'), ('board', 'NN', 'I-NP'), ('as', 'IN', 'O'),
('a', 'DT', 'B-NP'), ('nonexecutive', 'JJ', 'I-NP'), ('director',
'NN', 'I-NP'), ('Nov.', 'NNP', 'B-NP-TMP'), ('29', 'CD', 'I-NP-TMP'),
('.', '.', 'O')]

Being able to flatten trees, opens up the possibility of training a chunker on corpora consisting of deep parse trees.

CESS-ESP and CESS-CAT treebank

The cess_esp and cess_cat corpora have parsed sentences, but no chunked sentences. In other words, they have deep trees that must be flattened in order to train a chunker. In fact, the trees are so deep that a diagram can't be shown, but the flattening can be demonstrated by showing the height() of the tree before and after flattening.

>>> from nltk.corpus import cess_esp
>>> cess_esp.parsed_sents()[0].height()
>>> flatten_deeptree(cess_esp.parsed_sents()[0]).height()

Creating a shallow tree

In the previous recipe, we flattened a deep Tree by only keeping the lowest level sub-trees. In this recipe, we'll keep only the highest level sub-trees instead.

How to do it...

We'll be using the first parsed sentence from the treebank corpus as our example. Recall from the previous recipe that the sentence Tree looks like this:

The shallow_tree() function defined in eliminates all the nested sub-trees, keeping only the top tree nodes.

from nltk.tree import Tree

def shallow_tree(tree):
children = []

for t in tree:
if t.height() < 3:
children.append(Tree(t.node, t.pos()))

return Tree(tree.node, children)

Using it on the first parsed sentence in treebank results in a Tree with only two sub-trees.

>>> from transforms import shallow_tree
>>> shallow_tree(treebank.parsed_sents()[0])
Tree('S', [Tree('NP-SBJ', [('Pierre', 'NNP'), ('Vinken', 'NNP'), (',',
','), ('61', 'CD'), ('years', 'NNS'), ('old', 'JJ'), (',', ',')]),
Tree('VP', [('will', 'MD'), ('join', 'VB'), ('the', 'DT'), ('board',
'NN'), ('as', 'IN'), ('a', 'DT'), ('nonexecutive', 'JJ'), ('director',
'NN'), ('Nov.', 'NNP'), ('29', 'CD')]), ('.', '.')])

We can visually and programmatically see the difference, as shown in the following diagram and code:

>>> treebank.parsed_sents()[0].height()
>>> shallow_tree(treebank.parsed_sents()[0]).height()

As in the previous recipe, the height of the new tree is three so it can be used for training a chunker.

How it works...

The shallow_tree() function iterates over each of the top-level sub-trees in order to create new child trees. If the height() of a sub-tree is less than three, then that sub-tree is replaced by a list of its part-of-speech tagged children. All other sub-trees are replaced by a new Tree whose children are the part-of-speech tagged leaves. This eliminates all nested sub-trees while retaining the top-level sub-trees.

This function is an alternative to flatten_deeptree() from the previous recipe, for when you want to keep the higher level tree nodes and ignore the lower level nodes.

Converting tree nodes

As you've seen in previous recipes, parse trees often have a variety of Tree node types that are not present in chunk trees. If you want to use the parse trees to train a chunker, then you'll probably want to reduce this variety by converting some of these tree nodes to more common node types.

Getting ready

First, we have to decide what Tree nodes need to be converted. Let's take a look at that first Tree again:

Immediately you can see that there are two alternative NP sub-trees: NP-SBJ and NP-TMP. Let's convert both of those to NP. The mapping will be as follows:

Original Node

New Node





How to do it...

In there is a function convert_tree_nodes(). It takes two arguments: the Tree to convert, and a node conversion mapping. It returns a new Tree with all matching nodes replaced based on the values in the mapping.

from nltk.tree import Tree

def convert_tree_nodes(tree, mapping):
children = []

for t in tree:
if isinstance(t, Tree):
children.append(convert_tree_nodes(t, mapping))

node = mapping.get(tree.node, tree.node)
return Tree(node, children)

Using the mapping table shown earlier, we can pass it in as a dict to convert_tree_ nodes() and convert the first parsed sentence from treebank.

>>> from transforms import convert_tree_nodes
>>> mapping = {'NP-SBJ': 'NP', 'NP-TMP': 'NP'}
>>> convert_tree_nodes(treebank.parsed_sents()[0], mapping)
Tree('S', [Tree('NP', [Tree('NP', [Tree('NNP', ['Pierre']),
Tree('NNP', ['Vinken'])]), Tree(',', [',']), Tree('ADJP', [Tree('NP',
[Tree('CD', ['61']), Tree('NNS', ['years'])]), Tree('JJ', ['old'])]),
Tree(',', [','])]), Tree('VP', [Tree('MD', ['will']), Tree('VP',
[Tree('VB', ['join']), Tree('NP', [Tree('DT', ['the']), Tree('NN',
['board'])]), Tree('PP-CLR', [Tree('IN', ['as']), Tree('NP',
[Tree('DT', ['a']), Tree('JJ', ['nonexecutive']), Tree('NN',
['director'])])]), Tree('NP', [Tree('NNP', ['Nov.']), Tree('CD',
['29'])])])]), Tree('.', ['.'])])

In the following diagram, you can see that the NP-* sub-trees have been replaced with NP sub-trees:

How it works...

convert_tree_nodes() recursively converts every child sub-tree using the mapping. The Treeis then rebuilt with the converted nodes and children until the entire Tree has been converted.

The result is a brand new Tree instance with new sub-trees whose nodes have been converted.


This article showed you how to do various transforms on both chunks and trees. The functions detailed in these recipes modify data, as opposed to learning from it.

Further resources on this subject:

You've been reading an excerpt of:

Python Text Processing with NLTK 2.0 Cookbook

Explore Title