Reader small image

You're reading from  Python 3 Text Processing with NLTK 3 Cookbook

Product typeBook
Published inAug 2014
Reading LevelBeginner
Publisher
ISBN-139781782167853
Edition1st Edition
Languages
Tools
Right arrow
Author (1)
Jacob Perkins
Jacob Perkins
author image
Jacob Perkins

Jacob Perkins is the cofounder and CTO of Weotta, a local search company. Weotta uses NLP and machine learning to create powerful and easy-to-use natural language search for what to do and where to go. He is the author of Python Text Processing with NLTK 2.0 Cookbook, Packt Publishing, and has contributed a chapter to the Bad Data Handbook, O'Reilly Media. He writes about NLTK, Python, and other technology topics at http://streamhacker.com. To demonstrate the capabilities of NLTK and natural language processing, he developed http://text-processing.com, which provides simple demos and NLP APIs for commercial use. He has contributed to various open source projects, including NLTK, and created NLTK-Trainer to simplify the process of training NLTK models. For more information, visit https://github.com/japerk/nltk-trainer.
Read more about Jacob Perkins

Right arrow

Training a sentence tokenizer


NLTK's default sentence tokenizer is general purpose, and usually works quite well. But sometimes it is not the best choice for your text. Perhaps your text uses nonstandard punctuation, or is formatted in a unique way. In such cases, training your own sentence tokenizer can result in much more accurate sentence tokenization.

Getting ready

For this example, we'll be using the webtext corpus, specifically the overheard.txt file, so make sure you've downloaded this corpus. The text in this file is formatted as dialog that looks like this:

White guy: So, do you have any plans for this evening?
Asian girl: Yeah, being angry!
White guy: Oh, that sounds good.

As you can see, this isn't your standard paragraph of sentences formatting, which makes it a perfect case for training a sentence tokenizer.

How to do it...

NLTK provides a PunktSentenceTokenizer class that you can train on raw text to produce a custom sentence tokenizer. You can get raw text either by reading in a file, or from an NLTK corpus using the raw() method. Here's an example of training a sentence tokenizer on dialog text, using overheard.txt from the webtext corpus:

>>> from nltk.tokenize import PunktSentenceTokenizer
>>> from nltk.corpus import webtext
>>> text = webtext.raw('overheard.txt')
>>> sent_tokenizer = PunktSentenceTokenizer(text)

Let's compare the results to the default sentence tokenizer, as follows:

>>> sents1 = sent_tokenizer.tokenize(text)
>>> sents1[0]
'White guy: So, do you have any plans for this evening?'

>>> from nltk.tokenize import sent_tokenize
>>> sents2 = sent_tokenize(text)
>>> sents2[0]
'White guy: So, do you have any plans for this evening?'
>>> sents1[678]
'Girl: But you already have a Big Mac...'
>>> sents2[678]
'Girl: But you already have a Big Mac...\\nHobo: Oh, this is all theatrical.'

While the first sentence is the same, you can see that the tokenizers disagree on how to tokenize sentence 679 (this is the first sentence where the tokenizers diverge). The default tokenizer includes the next line of dialog, while our custom tokenizer correctly thinks that the next line is a separate sentence. This difference is a good demonstration of why it can be useful to train your own sentence tokenizer, especially when your text isn't in the typical paragraph-sentence structure.

How it works...

The PunktSentenceTokenizer class uses an unsupervised learning algorithm to learn what constitutes a sentence break. It is unsupervised because you don't have to give it any labeled training data, just raw text. You can read more about these kinds of algorithms at https://en.wikipedia.org/wiki/Unsupervised_learning. The specific technique used in this case is called sentence boundary detection and it works by counting punctuation and tokens that commonly end a sentence, such as a period or newline, then using the resulting frequencies to decide what the sentence boundaries should actually look like.

This is a simplified description of the algorithm—if you'd like more details, take a look at the source code of the nltk.tokenize.punkt.PunktTrainer class, which can be found online at http://www.nltk.org/_modules/nltk/tokenize/punkt.html#PunktSentenceTokenizer.

There's more...

The PunktSentenceTokenizer class learns from any string, which means you can open a text file and read its content. Here is an example of reading overheard.txt directly instead of using the raw() corpus method. This assumes that the webtext corpus is located in the standard directory at /usr/share/nltk_data/corpora. We also have to pass a specific encoding to the open() function, as follows, because the file is not in ASCII:

>>> with open('/usr/share/nltk_data/corpora/webtext/overheard.txt', encoding='ISO-8859-2') as f:
...   text = f.read()
>>> sent_tokenizer = PunktSentenceTokenizer(text)
>>> sents = sent_tokenizer.tokenize(text)
>>> sents[0]
'White guy: So, do you have any plans for this evening?'
>>> sents[678]
'Girl: But you already have a Big Mac...'

Once you have a custom sentence tokenizer, you can use it for your own corpora. Many corpus readers accept a sent_tokenizer parameter, which lets you override the default sentence tokenizer object with your own sentence tokenizer. Corpus readers are covered in more detail in Chapter 3, Creating Custom Corpora.

See also

Most of the time, the default sentence tokenizer will be sufficient. This is covered in the first recipe, Tokenizing text into sentences.

Previous PageNext Page
You have been reading a chapter from
Python 3 Text Processing with NLTK 3 Cookbook
Published in: Aug 2014Publisher: ISBN-13: 9781782167853
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Jacob Perkins

Jacob Perkins is the cofounder and CTO of Weotta, a local search company. Weotta uses NLP and machine learning to create powerful and easy-to-use natural language search for what to do and where to go. He is the author of Python Text Processing with NLTK 2.0 Cookbook, Packt Publishing, and has contributed a chapter to the Bad Data Handbook, O'Reilly Media. He writes about NLTK, Python, and other technology topics at http://streamhacker.com. To demonstrate the capabilities of NLTK and natural language processing, he developed http://text-processing.com, which provides simple demos and NLP APIs for commercial use. He has contributed to various open source projects, including NLTK, and created NLTK-Trainer to simplify the process of training NLTK models. For more information, visit https://github.com/japerk/nltk-trainer.
Read more about Jacob Perkins