Packt+ | Advance your knowledge in tech

You're reading from Python 3 Text Processing with NLTK 3 Cookbook

Product type Book

Published in Aug 2014

Publisher

ISBN-13 9781782167853

Pages 304 pages

Edition 1st Edition

Languages

Python

Concepts

Data Processing

Author (1):

Jacob Perkins

Table of Contents (17) Chapters

Python 3 Text Processing with NLTK 3 Cookbook

Credits

About the Author

About the Reviewers

www.PacktPub.com

Preface

Tokenizing Text and WordNet Basics

Replacing and Correcting Words

Creating Custom Corpora

Part-of-speech Tagging

Extracting Chunks

Transforming Chunks and Trees

Text Classification

Distributed Processing and Handling Large Datasets

Parsing Specific Data Types

Penn Treebank Part-of-speech Tags

Index

Chapter 3. Creating Custom Corpora

In this chapter, we will cover the following recipes:

Setting up a custom corpus
Creating a wordlist corpus
Creating a part-of-speech tagged word corpus
Creating a chunked phrase corpus
Creating a categorized text corpus
Creating a categorized chunk corpus reader
Lazy corpus loading
Creating a custom corpus view
Creating a MongoDB-backed corpus reader
Corpus editing with file locking

Introduction

In this chapter, we'll cover how to use corpus readers and create custom corpora. If you want to train your own model, such as a part-of-speech tagger or text classifier, you will need to create a custom corpus to train on. Model training is covered in the subsequent chapters.

Now you'll learn how to use the existing corpus data that comes with NLTK. This information is essential for future chapters when we'll need to access the corpora as training data. You've already accessed the WordNet corpus in Chapter 1, Tokenizing Text and WordNet Basics. This chapter will introduce you to many more corpora.

We'll also cover creating custom corpus readers, which can be used when your corpus is not in a file format that NLTK already recognizes, or if your corpus is not located in files at all, but instead is located in a database such as MongoDB. It is essential to be familiar with tokenization, which was covered in Chapter 1, Tokenizing Text and WordNet Basics.

Setting up a custom corpus

A corpus is a collection of text documents, and corpora is the plural of corpus. This comes from the Latin word for body; in this case, a body of text. So a custom corpus is really just a bunch of text files in a directory, often alongside many other directories of text files.

Getting ready

You should already have the NLTK data package installed, following the instructions at http://www.nltk.org/data. We'll assume that the data is installed to C:\nltk_data on Windows, and /usr/share/nltk_data on Linux, Unix, and Mac OS X.

How to do it...

NLTK defines a list of data directories, or paths, in nltk.data.path. Our custom corpora must be within one of these paths so it can be found by NLTK. In order to avoid conflict with the official data package, we'll create a custom nltk_data directory in our home directory. The following is some Python code to create this directory and verify that it is in the list of known paths specified by nltk.data.path:

>>> import os,...

Creating a wordlist corpus

The WordListCorpusReader class is one of the simplest CorpusReader classes. It provides access to a file containing a list of words, one word per line. In fact, you've already used it when we used the stopwords corpus in Chapter 1, Tokenizing Text and WordNet Basics, in the Filtering stopwords in a tokenized sentence and Discovering word collocations recipes.

Getting ready

We need to start by creating a wordlist file. This could be a single column CSV file, or just a normal text file with one word per line. Let's create a file named wordlist that looks like this:

nltk
corpus
corpora
wordnet

How to do it...

Now we can instantiate a WordListCorpusReader class that will produce a list of words from our file. It takes two arguments: the directory path containing the files, and a list of filenames. If you open the Python console in the same directory as the files, then '.' can be used as the directory path. Otherwise, you must use a directory path such as nltk_data/corpora...

Creating a part-of-speech tagged word corpus

Part-of-speech tagging is the process of identifying the part-of-speech tag for a word. Most of the time, a tagger must first be trained on a training corpus. How to train and use a tagger is covered in detail in Chapter 4, Part-of-speech Tagging, but first we must know how to create and use a training corpus of part-of-speech tagged words.

Getting ready

The simplest format for a tagged corpus is of the form word/tag. The following is an excerpt from the brown corpus:

The/at-tl expense/nn and/cc time/nn involved/vbn are/ber astronomical/jj ./.

Each word has a tag denoting its part-of-speech. For example, nn refers to a noun, while a tag that starts with vb is a verb.

Note

Different corpora can use different tags to mean the same thing. For example, the treebank corpus uses different tags as compared to the brown corpus, even though both are English text. But both sets of tags can be converted into a universal tagset, described at the end of this recipe...

Creating a chunked phrase corpus

A chunk is a short phrase within a sentence. If you remember sentence diagrams from grade school, they were a tree-like representation of phrases within a sentence. This is exactly what chunks are: subtrees within a sentence tree, and they will be covered in much more detail in Chapter 5, Extracting Chunks. The following is a sample sentence tree with three Noun Phrase (NP) chunks shown as subtrees:

This recipe will cover how to create a corpus with sentences that contain chunks.

Getting ready

The following is an excerpt from the tagged treebank corpus. It has part-of-speech tags, as in the previous recipe, but it also has square brackets for denoting chunks. The text within the brackets has been highlighted to make the chunks more apparent. The following sentence is the same sentence as in the previous tree diagram, but in text form:

[Earlier/JJR staff-reduction/NN moves/NNS] have/VBP trimmed/VBN about/IN [300/CD jobs/NNS] ,/, [the/DT spokesman/NN] said/VBD...

Creating a categorized text corpus

If you have a large corpus of text, you might want to categorize it into separate sections. This can be helpful for organization, or for text classification, which is covered in Chapter 7, Text Classification. The brown corpus, for example, has a number of different categories, as shown in the following code:

>>> from nltk.corpus import brown
>>> brown.categories()
['adventure', 'belles_lettres', 'editorial', 'fiction', 'government', 'hobbies', 'humor', 'learned', 'lore', 'mystery', 'news', 'religion', 'reviews', 'romance', 'science_fiction']

In this recipe, we'll learn how to create our own categorized text corpus.

Getting ready

The easiest way to categorize a corpus is to have one file for each category. The following are two excerpts from the movie_reviews corpus:

movie_pos.txt:

the thin red line is flawed but it provokes .

movie_neg.txt:

a big-budget and glossy production can not make up for a lack of spontaneity that permeates their tv show...

Creating a categorized chunk corpus reader

NLTK provides a CategorizedPlaintextCorpusReader and CategorizedTaggedCorpusReader class, but there's no categorized corpus reader for chunked corpora. So in this recipe, we're going to make one.

Getting ready

Refer to the earlier recipe, Creating a chunked phrase corpus, for an explanation of ChunkedCorpusReader, and refer to the previous recipe for details on CategorizedPlaintextCorpusReader and CategorizedTaggedCorpusReader, both of which inherit from CategorizedCorpusReader.

How to do it...

We'll create a class called CategorizedChunkedCorpusReader that inherits from both CategorizedCorpusReader and ChunkedCorpusReader. It is heavily based on the CategorizedTaggedCorpusReader class, and also provides three additional methods for getting categorized chunks. The following code is found in catchunked.py:

from nltk.corpus.reader import CategorizedCorpusReader, ChunkedCorpusReader

class CategorizedChunkedCorpusReader(CategorizedCorpusReader, ChunkedCorpusReader...

Lazy corpus loading

Loading a corpus reader can be an expensive operation due to the number of files, file sizes, and various initialization tasks. And while you'll often want to specify a corpus reader in a common module, you don't always need to access it right away. To speed up module import time when a corpus reader is defined, NLTK provides a LazyCorpusLoader class that can transform itself into your actual corpus reader as soon as you need it. This way, you can define a corpus reader in a common module without it slowing down module loading.

How to do it...

The LazyCorpusLoader class requires two arguments: the name of the corpus and the corpus reader class, plus any other arguments needed to initialize the corpus reader class.

The name argument specifies the root directory name of the corpus, which must be within a corpora subdirectory of one of the paths in nltk.data.path. See the Setting up a custom corpus recipe of this chapter for more details on nltk.data.path.

For example, if you...

Creating a custom corpus view

A corpus view is a class wrapper around a corpus file that reads in blocks of tokens as needed. Its purpose is to provide a view into a file without reading the whole file at once (since corpus files can often be quite large). If the corpus readers included by NLTK already meet all your needs, then you do not have to know anything about corpus views. But, if you have a custom file format that needs special handling, this recipe will show you how to create and use a custom corpus view. The main corpus view class is StreamBackedCorpusView, which opens a single file as a stream, and maintains an internal cache of blocks it has read.

Blocks of tokens are read in with a block reader function. A block can be any piece of text, such as a paragraph or a line, and tokens are parts of a block, such as individual words. In the Creating a part-of-speech tagged word corpus recipe, we discussed the default para_block_reader function of the TaggedCorpusReader class, which reads...

Creating a MongoDB-backed corpus reader

All the corpus readers we've dealt with so far have been file-based. That is in part due to the design of the CorpusReader base class, and also the assumption that most corpus data will be in text files. However, sometimes you'll have a bunch of data stored in a database that you want to access and use just like a text file corpus. In this recipe, we'll cover the case where you have documents in MongoDB, and you want to use a particular field of each document as your block of text.

Getting ready

MongoDB is a document-oriented database that has become a popular alternative to relational databases such as MySQL. The installation and setup of MongoDB is outside the scope of this book, but you can find instructions at http://docs.mongodb.org/manual/.

You'll also need to install PyMongo, a Python driver for MongoDB. You should be able to do this with either easy_install or pip, by typing sudo easy_install pymongo or sudo pip install pymongo.

The following code...

Corpus editing with file locking

Corpus readers and views are all read-only, but there will be times when you want to add to or edit the corpus files. However, modifying a corpus file while other processes are using it, such as through a corpus reader, can lead to dangerous undefined behavior. This is where file locking comes in handy.

Getting ready

You must install the lockfile library using sudo easy_install lockfile or sudo pip install lockfile. This library provides cross-platform file locking, and so will work on Windows, Unix/Linux, Mac OS X, and more. You can find detailed documentation on lockfile at http://packages.python.org/lockfile/.

How to do it...

Here are two file editing functions: append_line() and remove_line(). Both try to acquire an exclusive lock on the file before updating it. An exclusive lock means that these functions will wait until no other process is reading from or writing to the file. Once the lock is acquired, any other process that tries to access the file will...

The rest of the chapter is locked

You're reading from Python 3 Text Processing with NLTK 3 Cookbook

Table of Contents (17) Chapters

Chapter 3. Creating Custom Corpora

Introduction

Setting up a custom corpus

Getting ready

How to do it...

Creating a wordlist corpus

Getting ready

How to do it...

Creating a part-of-speech tagged word corpus

Getting ready

Note

Creating a chunked phrase corpus

Getting ready

Creating a categorized text corpus

Getting ready

Creating a categorized chunk corpus reader

Getting ready

How to do it...

Lazy corpus loading

How to do it...

Creating a custom corpus view

Creating a MongoDB-backed corpus reader

Getting ready

Corpus editing with file locking

Getting ready

How to do it...

Authors (1)

Personalised recommendations for you

You're reading from Python 3 Text Processing with NLTK 3 Cookbook

Table of Contents (17) Chapters

Chapter 3. Creating Custom Corpora

Introduction

Setting up a custom corpus

Getting ready

How to do it...

Creating a wordlist corpus

Getting ready

How to do it...

Creating a part-of-speech tagged word corpus

Getting ready

Note

Creating a chunked phrase corpus

Getting ready

Creating a categorized text corpus

Getting ready

Creating a categorized chunk corpus reader

Getting ready

How to do it...

Lazy corpus loading

How to do it...

Creating a custom corpus view

Creating a MongoDB-backed corpus reader

Getting ready

Corpus editing with file locking

Getting ready

How to do it...

Unlock this book and the full library FREE for 7 days

Authors (1)

Personalised recommendations for you