Search icon
Arrow left icon
All Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletters
Free Learning
Arrow right icon
Python 3 Text Processing with NLTK 3 Cookbook

You're reading from  Python 3 Text Processing with NLTK 3 Cookbook

Product type Book
Published in Aug 2014
Publisher
ISBN-13 9781782167853
Pages 304 pages
Edition 1st Edition
Languages
Author (1):
Jacob Perkins Jacob Perkins
Profile icon Jacob Perkins

Table of Contents (17) Chapters

Python 3 Text Processing with NLTK 3 Cookbook
Credits
About the Author
About the Reviewers
www.PacktPub.com
Preface
Tokenizing Text and WordNet Basics Replacing and Correcting Words Creating Custom Corpora Part-of-speech Tagging Extracting Chunks Transforming Chunks and Trees Text Classification Distributed Processing and Handling Large Datasets Parsing Specific Data Types Penn Treebank Part-of-speech Tags
Index

Chapter 5. Extracting Chunks

In this chapter, we will cover the following recipes:

  • Chunking and chinking with regular expressions

  • Merging and splitting chunks with regular expressions

  • Expanding and removing chunks with regular expressions

  • Partial parsing with regular expressions

  • Training a tagger-based chunker

  • Classification-based chunking

  • Extracting named entities

  • Extracting proper noun chunks

  • Extracting location chunks

  • Training a named entity chunker

  • Training a chunker with NLTK-Trainer

Introduction


Chunk extraction, or partial parsing, is the process of extracting short phrases from a part-of-speech tagged sentence. This is different from full parsing in that we're interested in standalone chunks, or phrases, instead of full parse trees (for more on parse trees, see https://en.wikipedia.org/wiki/Parse_tree). The idea is that meaningful phrases can be extracted from a sentence by looking for particular patterns of part-of-speech tags.

As in Chapter 4, Part-of-speech Tagging, we'll be using the Penn Treebank corpus for basic training and testing chunk extraction. We'll also be using the CoNLL2000 corpus as it has a simpler and more flexible format that supports multiple chunk types (for more details on the conll2000 corpus and IOB tags, see the Creating a chunked phrase corpus recipe in Chapter 3, Creating Custom Corpora).

Chunking and chinking with regular expressions


Using modified regular expressions, we can define chunk patterns. These are patterns of part-of-speech tags that define what kinds of words make up a chunk. We can also define patterns for what kinds of words should not be in a chunk. These unchunked words are known as chinks.

A ChunkRule class specifies what to include in a chunk, while a ChinkRule class specifies what to exclude from a chunk. In other words, chunking creates chunks, while chinking breaks up those chunks.

Getting ready

We first need to know how to define chunk patterns. These are modified regular expressions designed to match sequences of part-of-speech tags. An individual tag is specified by surrounding angle brackets, such as <NN> to match a noun tag. Multiple tags can then be combined, as in <DT><NN> to match a determiner followed by a noun. Regular expression syntax can be used within the angle brackets to match individual tag patterns, so you can do <...

Merging and splitting chunks with regular expressions


In this recipe, we'll cover two more rules for chunking. A MergeRule class can merge two chunks together based on the end of the first chunk and the beginning of the second chunk. A SplitRule class will split a chunk into two chunks based on the specified split pattern.

How to do it...

A SplitRule class is specified with two opposing curly braces surrounded by a pattern on either side. To split a chunk after a noun, you would do <NN.*>}{<.*>. A MergeRule class is specified by flipping the curly braces, and will join chunks where the end of the first chunk matches the left pattern and the beginning of the next chunk matches the right pattern. To merge two chunks where the first ends with a noun and the second begins with a noun, you'd use <NN.*>{}<NN.*>.

Note

Note that the order of rules is very important, and reordering can affect the results. The RegexpParser class applies the rules one at a time from top to bottom...

Expanding and removing chunks with regular expressions


There are three RegexpChunkRule subclasses that are not supported by RegexpChunkRule.fromstring() or RegexpParser, and therefore must be created manually if you want to use them. These rules are as follows:

  • ExpandLeftRule: Add unchunked (chink) words to the left of a chunk

  • ExpandRightRule: Add unchunked (chink) words to the right of a chunk

  • UnChunkRule: Unchunk any matching chunk

How to do it...

ExpandLeftRule and ExpandRightRule both take two patterns along with a description as arguments. For ExpandLeftRule, the first pattern is the chink we want to add to the beginning of the chunk, while the right pattern will match the beginning of the chunk we want to expand. With ExpandRightRule, the left pattern should match the end of the chunk we want to expand, and the right pattern matches the chink we want to add to the end of the chunk. The idea is similar to the MergeRule class, but in this case, we're merging chink words instead of other...

Partial parsing with regular expressions


So far, we've only been parsing noun phrases. But RegexpParser supports grammars with multiple phrase types, such as verb phrases and prepositional phrases. We can put the rules we've learned to use and define a grammar that can be evaluated against the conll2000 corpus, which has NP, VP, and PP phrases.

How to do it...

Now, we will define a grammar to parse three phrase types. For noun phrases, we have a ChunkRule class that looks for an optional determiner followed by one or more nouns. We then have a MergeRule class for adding an adjective to the front of a noun chunk. For prepositional phrases, we simply chunk any IN word, such as in or on. For verb phrases, we chunk an optional modal word (such as should) followed by a verb.

Note

Each grammar rule is followed by a # comment. This comment is passed into each rule as the description. Comments are optional, but they can be helpful notes for understanding what the rule does, and will be included in trace...

Training a tagger-based chunker


Training a chunker can be a great alternative to manually specifying regular expression chunk patterns. Instead of a pain-staking process of trial and error to get the exact right patterns, we can use existing corpus data to train chunkers much like we did for part-of-speech tagging in the previous chapter.

How to do it...

As with the part-of-speech tagging, we'll use the treebank corpus data for training. But this time, we'll use the treebank_chunk corpus, which is specifically formatted to produce chunked sentences in the form of trees. These chunked_sents() methods will be used by a TagChunker class to train a tagger-based chunker. The TagChunker class uses a helper function, conll_tag_chunks(), to extract a list of (pos, iob) tuples from a list of Trees. These (pos, iob) tuples are then used to train a tagger in the same way (word, pos) tuples were used in Chapter 4, Part-of-speech Tagging, to train part-of-speech taggers. But instead of learning part-of...

Classification-based chunking


Unlike most part-of-speech taggers, the ClassifierBasedTagger class learns from features. That means we can create a ClassifierChunker class that can learn from both the words and part-of-speech tags, instead of only the part-of-speech tags as the TagChunker class does.

How to do it...

For the ClassifierChunker class, we don't want to discard the words from the training sentences as we did in the previous recipe. Instead, to remain compatible with the 2-tuple (word, pos) format required for training a ClassiferBasedTagger class, we convert the (word, pos, iob) 3-tuples from tree2conlltags() into ((word, pos), iob) 2-tuples using the chunk_trees2train_chunks() function. This code can be found in chunkers.py:

from nltk.chunk import ChunkParserI
from nltk.chunk.util import tree2conlltags, conlltags2tree
from nltk.tag import ClassifierBasedTagger

def chunk_trees2train_chunks(chunk_sents):
  tag_sents = [tree2conlltags(sent) for sent in chunk_sents]
  return [[((w...

Extracting named entities


Named entity recognition is a specific kind of chunk extraction that uses entity tags instead of, or in addition to, chunk tags. Common entity tags include PERSON, ORGANIZATION, and LOCATION. Part-of-speech tagged sentences are parsed into chunk trees as with normal chunking, but the labels of the trees can be entity tags instead of chunk phrase tags.

How to do it...

NLTK comes with a pre-trained named entity chunker. This chunker has been trained on data from the ACE program, National Institute of Standards and Technology (NIST) sponsored program for Automatic Content Extraction, which you can read more about at http://www.itl.nist.gov/iad/894.01/tests/ace/. Unfortunately, this data is not included in the NLTK corpora, but the trained chunker is. This chunker can be used through the ne_chunk() method in the nltk.chunk module. The ne_chunk() method will chunk a single sentence into a Tree. The following is an example using ne_chunk() on the first tagged sentence...

Extracting proper noun chunks


A simple way to do named entity extraction is to chunk all proper nouns (tagged with NNP). We can tag these chunks as NAME, since the definition of a proper noun is the name of a person, place, or thing.

How to do it...

Using the RegexpParser class, we can create a very simple grammar that combines all proper nouns into a NAME chunk. Then, we can test this on the first tagged sentence of treebank_chunk to compare the results with the previous recipe:

>>> chunker = RegexpParser(r'''
... NAME:
...   {<NNP>+}
... ''')
>>> sub_leaves(chunker.parse(treebank_chunk.tagged_sents()[0]), 'NAME')
[[('Pierre', 'NNP'), ('Vinken', 'NNP')], [('Nov.', 'NNP')]]

Although we get Nov. as a NAME chunk, this isn't a wrong result, as Nov. is the name of a month.

How it works...

The NAME chunker is a simple usage of the RegexpParser class, covered in the Chunking and chinking with regular expressions, Merging and splitting chunks with regular expressions, and Partial...

Extracting location chunks


To identify LOCATION chunks, we can make a different kind of ChunkParserI subclass that uses the gazetteers corpus to identify location words. The gazetteers corpus is a WordListCorpusReader class that contains the following location words:

  • Country names

  • U.S. states and abbreviations

  • Major U.S. cities

  • Canadian provinces

  • Mexican states

How to do it...

The LocationChunker class, found in chunkers.py, iterates over a tagged sentence looking for words that are found in the gazetteers corpus. When it finds one or more location words, it creates a LOCATION chunk using IOB tags. The helper method iob_locations() is where the IOB LOCATION tags are produced, and the parse() method converts these IOB tags into a Tree:

from nltk.chunk import ChunkParserI
from nltk.chunk.util import conlltags2tree
from nltk.corpus import gazetteers

class LocationChunker(ChunkParserI):
  def __init__(self):
    self.locations = set(gazetteers.words())
    self.lookahead = 0

    for loc in self.locations...

Training a named entity chunker


You can train your own named entity chunker using the ieer corpus, which stands for Information Extraction: Entity Recognition. It takes a bit of extra work, though, because the ieer corpus has chunk trees but no part-of-speech tags for words.

How to do it...

Using the ieertree2conlltags() and ieer_chunked_sents() functions in chunkers.py, we can create named entity chunk trees from the ieer corpus to train the ClassifierChunker class created in the Classification-based chunking recipe:

import nltk.tag
from nltk.chunk.util import conlltags2tree
from nltk.corpus import ieer

def ieertree2conlltags(tree, tag=nltk.tag.pos_tag):
  words, ents = zip(*tree.pos())
  iobs = []
  prev = None

  for ent in ents:
    if ent == tree.label():
      iobs.append('O')
      prev = None
    elif prev == ent:
      iobs.append('I-%s' % ent)
    else:
      iobs.append('B-%s' % ent)
      prev = ent

  words, tags = zip(*tag(words))
  return zip(words, tags, iobs)

def ieer_chunked_sents...

Training a chunker with NLTK-Trainer


At the end of the previous chapter, Chapter 4, Part-of-speech Tagging, we introduced NLTK-Trainer and the train_tagger.py script. In this recipe, we will cover the script for training chunkers: train_chunker.py.

Note

You can find NLTK-Trainer at https://github.com/japerk/nltk-trainer and the online documentation at http://nltk-trainer.readthedocs.org/.

How to do it...

As with train_tagger.py, the only required argument to train_chunker.py is the name of a corpus. In this case, we need a corpus that provides a chunked_sents() method, such as treebank_chunk. Here's an example of running train_chunker.py on treebank_chunk:

$ python train_chunker.py treebank_chunk
loading treebank_chunk
4009 chunks, training on 4009
training ub TagChunker
evaluating TagChunker
ChunkParse score:
    IOB Accuracy:   97.0%
    Precision:      90.8%
    Recall:         93.9%
    F-Measure:      92.3%
dumping TagChunker to /Users/jacob/nltk_data/chunkers/treebank_chunk_ub.pickle

Just...

lock icon The rest of the chapter is locked
You have been reading a chapter from
Python 3 Text Processing with NLTK 3 Cookbook
Published in: Aug 2014 Publisher: ISBN-13: 9781782167853
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime}