Packt+ | Advance your knowledge in tech

You're reading from Python 3 Text Processing with NLTK 3 Cookbook

Product typeBook

Published inAug 2014

Reading LevelBeginner

Publisher

ISBN-139781782167853

Edition1st Edition

Languages

Python

Tools

NLTK

Concepts

Data Processing

Author (1)

Jacob Perkins

Chapter 6. Transforming Chunks and Trees

In this chapter, we will cover the following recipes:

Filtering insignificant words from a sentence
Correcting verb forms
Swapping verb phrases
Swapping noun cardinals
Swapping infinitive phrases
Singularizing plural nouns
Chaining chunk transformations
Converting a chunk tree to text
Flattening a deep tree
Creating a shallow tree
Converting tree labels

Introduction

Now that you know how to get chunks/phrases from a sentence, what do you do with them? This chapter will show you how to do various transforms on both chunks and trees. The chunk transforms are for grammatical correction and rearranging phrases without loss of meaning. The tree transforms give you ways to modify and flatten deep parse trees. The functions detailed in these recipes modify data, as opposed to learning from it. This means it's not safe to apply them indiscriminately. A thorough knowledge of the data you want to transform, along with a few experiments, should help you decide which functions to apply and when.

Whenever the term chunk is used in this chapter, it could refer to an actual chunk extracted by a chunker, or it could simply refer to a short phrase or sentence in the form of a list of tagged words. What's important in this chapter is what you can do with a chunk, not where it came from.

Filtering insignificant words from a sentence

Many of the most commonly used words are insignificant when it comes to discerning the meaning of a phrase. For example, in the phrase the movie was terrible, the most significant words are movie and terrible, while the and was are almost useless. You could get the same meaning if you took them out, that is, movie terrible or terrible movie. Either way, the sentiment is the same. In this recipe, we'll learn how to remove the insignificant words and keep the significant ones by looking at their part-of-speech tags.

Getting ready

First, we need to decide which part-of-speech tags are significant and which are not. Looking through the treebank corpus for stopwords yields the following table of insignificant words and tags:

Word	Tag
a	DT
all	PDT
an	DT
and	CC
or	CC
that	WDT
the	DT

Other than CC, all the tags end with DT. This means we can filter out insignificant words by looking at the tag's suffix. Refer to Appendix A, Penn Treebank...

Correcting verb forms

It's fairly common to find incorrect verb forms in real-world language. For example, the correct form of is our children learning? is are our children learning? The verb is should only be used with singular nouns, while are is for plural nouns, such as children. We can correct these mistakes by creating verb correction mappings that are used depending on whether there's a plural or singular noun in the chunk.

Getting ready

We first need to define the verb correction mappings in transforms.py. We'll create two mappings, one for plural to singular and another for singular to plural:

plural_verb_forms = {
  ('is', 'VBZ'): ('are', 'VBP'),
  ('was', 'VBD'): ('were', 'VBD')
}

singular_verb_forms = {
  ('are', 'VBP'): ('is', 'VBZ'),
  ('were', 'VBD'): ('was', 'VBD')
}

Each mapping has a tagged verb that maps to another tagged verb. These initial mappings cover the basics of mapping is to are, was to were, and vice versa.

How to do it...

In transforms.py is a function called correct_verbs...

Swapping verb phrases

Swapping the words around a verb can eliminate the passive voice from particular phrases. For example, the book was great can be transformed into the great book. This kind of normalization can also help with frequency analysis, by counting two apparently different phrases as the same phrase.

How to do it...

In transforms.py is a function called swap_verb_phrase(). It swaps the right-hand side of the chunk with the left-hand side, using the verb as the pivot point. It uses the first_chunk_index() function defined in the previous recipe to find the verb to pivot around.

def swap_verb_phrase(chunk):
  def vbpred(wt):
    word, tag = wt
    return tag != 'VBG' and tag.startswith('VB') and len(tag) > 2

  vbidx = first_chunk_index(chunk, vbpred)

  if vbidx is None:
    return chunk

  return chunk[vbidx+1:] + chunk[:vbidx]

Now we can see how it works on the part-of-speech tagged phrase the book was great:

>>> swap_verb_phrase([('the', 'DT'), ('book', 'NN'), ('was...

Swapping noun cardinals

In a chunk, a cardinal word, tagged as CD, refers to a number, such as 10. These cardinals often occur before or after a noun. For normalization purposes, it can be useful to always put the cardinal before the noun.

How to do it...

The swap_noun_cardinal() function is defined in transforms.py. It swaps any cardinal that occurs immediately after a noun with the noun so that the cardinal occurs immediately before the noun. It uses a helper function, tag_equals(), which is similar to tag_startswith(), but in this case, the function it returns does an equality comparison with the given tag:

def tag_equals(tag):
  def f(wt):
    return wt[1] == tag
  return f

Now we can define swap_noun_cardinal():

def swap_noun_cardinal(chunk):
  cdidx = first_chunk_index(chunk, tag_equals('CD'))
  # cdidx must be > 0 and there must be a noun immediately before it
  if not cdidx or not chunk[cdidx-1][1].startswith('NN'):
    return chunk

  noun, nntag = chunk[cdidx-1]
  chunk[cdidx-1]...

Swapping infinitive phrases

An infinitive phrase has the form A of B, such as book of recipes. These can often be transformed into a new form while retaining the same meaning, such as recipes book.

How to do it...

An infinitive phrase can be found by looking for a word tagged with IN. The swap_infinitive_phrase() function, defined in transforms.py, will return a chunk that swaps the portion of the phrase after the IN word with the portion before the IN word:

def swap_infinitive_phrase(chunk):
  def inpred(wt):
    word, tag = wt
    return tag == 'IN' and word != 'like'

  inidx = first_chunk_index(chunk, inpred)

  if inidx is None:
    return chunk

  nnidx = first_chunk_index(chunk, tag_startswith('NN'), start=inidx, step=-1) or 0
  return chunk[:nnidx] + chunk[inidx+1:] + chunk[nnidx:inidx]

The function can now be used to transform book of recipes into recipes book:

>>> from transforms import swap_infinitive_phrase
>>> swap_infinitive_phrase([('book', 'NN'), ('of', 'IN'...

Singularizing plural nouns

As we saw in the previous recipe, the transformation process can result in phrases such as recipes book. This is a NNS followed by a NN, when a more proper version of the phrase would be recipe book, which is a NN followed by another NN. We can do another transform to correct these improper plural nouns.

How to do it...

The transforms.py script defines a function called singularize_plural_noun() which will depluralize a plural noun (tagged with NNS) that is followed by another noun:

def singularize_plural_noun(chunk):
  nnsidx = first_chunk_index(chunk, tag_equals('NNS'))

  if nnsidx is not None and nnsidx+1 < len(chunk) and chunk[nnsidx+1][1][:2] == 'NN':
    noun, nnstag = chunk[nnsidx]
    chunk[nnsidx] = (noun.rstrip('s'), nnstag.rstrip('S'))

  return chunk

And using it on recipes book, we get the more correct form, recipe book.

>>> singularize_plural_noun([('recipes', 'NNS'), ('book', 'NN')])
[('recipe', 'NN'), ('book', 'NN')]

How it works...

We start...

Chaining chunk transformations

The transform functions defined in the previous recipes can be chained together to normalize chunks. The resulting chunks are often shorter with no loss of meaning.

How to do it...

In transforms.py is the function transform_chunk(). It takes a single chunk and an optional list of transform functions. It calls each transform function on the chunk, one at a time, and returns the final chunk:

def transform_chunk(chunk, chain=[filter_insignificant, swap_verb_phrase, swap_infinitive_phrase, singularize_plural_noun], trace=0):
  for f in chain:
    chunk = f(chunk)

    if trace:
      print f.__name__, ':', chunk

  return chunk

Using it on the phrase the book of recipes is delicious, we get delicious recipe book:

>>> from transforms import transform_chunk
>>> transform_chunk([('the', 'DT'), ('book', 'NN'), ('of', 'IN'), ('recipes', 'NNS'), ('is', 'VBZ'), ('delicious', 'JJ')])
[('delicious', 'JJ'), ('recipe', 'NN'), ('book', 'NN')]

How it works...

The...

Converting a chunk tree to text

At some point, you may want to convert a Tree or subtree back to a sentence or chunk string. This is mostly straightforward, except when it comes to properly outputting punctuation.

How to do it...

We'll use the first tree of the treebank_chunk corpus as our example. The obvious first step is to join all the words in the tree with a space:

>>> from nltk.corpus import treebank_chunk
>>> tree = treebank_chunk.chunked_sents()[0]
>>> ' '.join([w for w, t in tree.leaves()])
'Pierre Vinken , 61 years old , will join the board as a nonexecutive director Nov. 29 .'

But as you can see, the punctuation isn't quite right. The commas and period are treated as individual words, and so get the surrounding spaces as well. But we can fix this using regular expression substitution. This is implemented in the chunk_tree_to_sent() function found in transforms.py:

import re
punct_re = re.compile(r'\s([,\.;\?])')

def chunk_tree_to_sent(tree, concat=' ')...

Flattening a deep tree

Some of the included corpora contain parsed sentences, which are often deep trees of nested phrases. Unfortunately, these trees are too deep to use for training a chunker, since IOB tag parsing is not designed for nested chunks. To make these trees usable for chunker training, we must flatten them.

Getting ready

We're going to use the first parsed sentence of the treebank corpus as our example. Here's a diagram showing how deeply nested this tree is:

You may notice that the part-of-speech tags are part of the tree structure instead of being included with the word. This will be handled later using the Tree.pos() method, which was designed specifically for combining words with preterminal Tree labels such as part-of-speech tags.

How to do it...

In transforms.py is a function named flatten_deeptree(). It takes a single Tree and will return a new Tree that keeps only the lowest-level trees. It uses a helper function, flatten_childtrees(), to do most of the work:

from nltk.tree...

Creating a shallow tree

In the previous recipe, we flattened a deep Tree by only keeping the lowest level subtrees. In this recipe, we'll keep only the highest level subtrees instead.

How to do it...

We'll be using the first parsed sentence from the treebank corpus as our example. Recall from the previous recipe that the sentence Tree looks like this:

The shallow_tree() function defined in transforms.py eliminates all the nested subtrees, keeping only the top subtree labels:

from nltk.tree import Tree

def shallow_tree(tree):
  children = []

  for t in tree:
    if t.height() < 3:
      children.extend(t.pos())
    else:
      children.append(Tree(t.label(), t.pos()))

  return Tree(tree.label(), children)

Using it on the first parsed sentence in treebank results in a Tree with only two subtrees:

>>> from transforms import shallow_tree
>>> shallow_tree(treebank.parsed_sents()[0])
Tree('S', [Tree('NP-SBJ', [('Pierre', 'NNP'), ('Vinken', 'NNP'), (',', ','), ('61', 'CD'), (...

Converting tree labels

As you've seen in previous recipes, parse trees often have a variety of Tree label types that are not present in chunk trees. If you want to use parse trees to train a chunker, then you'll probably want to reduce this variety by converting some of these tree labels to more common label types.

Getting ready

First, we have to decide which Tree labels need to be converted. Let's take a look at that first Tree again:

Immediately, you can see that there are two alternative NP subtrees: NP-SBJ and NP-TMP. Let's convert both of those to NP. The mapping will be as follows:

Original Label	New Label
NP-SBJ	NP
NP-TMP	NP

How to do it...

In transforms.py is the function convert_tree_labels(). It takes two arguments: the Tree to convert and a label conversion mapping. It returns a new Tree with all matching labels replaced based on the values in the mapping:

from nltk.tree import Tree

def convert_tree_labels(tree, mapping):
  children = []

  for t in tree:
    if isinstance(t...

The rest of the chapter is locked

You have been reading a chapter from

Python 3 Text Processing with NLTK 3 Cookbook

Published in: Aug 2014Publisher: ISBN-13: 9781782167853

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at $15.99/month. Cancel anytime

Author (1)

Jacob Perkins

Jacob Perkins is the cofounder and CTO of Weotta, a local search company. Weotta uses NLP and machine learning to create powerful and easy-to-use natural language search for what to do and where to go. He is the author of Python Text Processing with NLTK 2.0 Cookbook, Packt Publishing, and has contributed a chapter to the Bad Data Handbook, O'Reilly Media. He writes about NLTK, Python, and other technology topics at http://streamhacker.com. To demonstrate the capabilities of NLTK and natural language processing, he developed http://text-processing.com, which provides simple demos and NLP APIs for commercial use. He has contributed to various open source projects, including NLTK, and created NLTK-Trainer to simplify the process of training NLTK models. For more information, visit https://github.com/japerk/nltk-trainer.
Read more about Jacob Perkins

Personalised recommendations for you

Based on your interests and search pattern

Et al.

Ever wonder why speech recognition systems don't understand the Scottish accent, or what would happen if an astronaut only ate mac 'n' cheese, or other spurious reflections you'd have at a bar? We did, then collated those deliberations into absurd research articles with fake figures and methodologies inspired by even more fictionally absurd studies.

BookAug 2023230 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages4

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages1

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Mastering Tableau 2023

This book is a comprehensive resource to mastering your Tableau skills and becoming a BI expert. As you progress, you will learn how to build advanced dashboards and improve your storytelling to derive key business insight, as well as make you well-versed with advanced functionalities of Tableau in the business intelligence domain.

BookAug 2023684 pages

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages5

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages2

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

BookOct 2023636 pages5

Modern Data Architecture on AWS

Every organization wants an agile, performant, and cost-effective data platform that meets all their current and future business needs. Purpose-built AWS analytics services and their features play a big part in building such a modern data platform. This book brings to you all the design and architectural patterns that’ll help you achieve this goal.

BookAug 2023420 pages5

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

BookDec 2023240 pages

TinyML Cookbook

With over 70 project-based recipes, the TinyML Cookbook is a practical guide that will help you to get the most out of your microcontrollers. It provides a comprehensive understanding of the theoretical foundations while giving you hands-on experience training ML models for deployment on Arduino Nano 33 BLE Sense, Raspberry Pi Pico, and SparkFun RedBoard Artemis Nano microcontrollers.

BookNov 2023664 pages