Free Sample
+ Collection

Python 3 Text Processing with NLTK 3 Cookbook

Jacob Perkins

Over 80 practical recipes on natural language processing techniques using Python's NLTK 3.0
RRP $26.99
RRP $44.99
Print + eBook

Want this title & more?

$16.99 p/month

Subscribe to PacktLib

Enjoy full and instant access to over 2000 books and videos – you’ll find everything you need to stay ahead of the curve and make sure you can always get the job done.

Book Details

ISBN 139781782167853
Paperback304 pages

About This Book

  • Break text down into its component parts for spelling correction, feature extraction, and phrase transformation
  • Learn how to do custom sentiment analysis and named entity recognition
  • Work through the natural language processing concepts with simple and easy-to-follow programming recipes

Who This Book Is For

This book is intended for Python programmers interested in learning how to do natural language processing. Maybe you’ve learned the limits of regular expressions the hard way, or you’ve realized that human language cannot be deterministically parsed like a computer language. Perhaps you have more text than you know what to do with, and need automated ways to analyze and structure that text. This Cookbook will show you how to train and use statistical language models to process text in ways that are practically impossible with standard programming tools. A basic knowledge of Python and the basic text processing concepts is expected. Some experience with regular expressions will also be helpful.

Table of Contents

Chapter 1: Tokenizing Text and WordNet Basics
Tokenizing text into sentences
Tokenizing sentences into words
Tokenizing sentences using regular expressions
Training a sentence tokenizer
Filtering stopwords in a tokenized sentence
Looking up Synsets for a word in WordNet
Looking up lemmas and synonyms in WordNet
Calculating WordNet Synset similarity
Discovering word collocations
Chapter 2: Replacing and Correcting Words
Stemming words
Lemmatizing words with WordNet
Replacing words matching regular expressions
Removing repeating characters
Spelling correction with Enchant
Replacing synonyms
Replacing negations with antonyms
Chapter 3: Creating Custom Corpora
Setting up a custom corpus
Creating a wordlist corpus
Creating a part-of-speech tagged word corpus
Creating a chunked phrase corpus
Creating a categorized text corpus
Creating a categorized chunk corpus reader
Lazy corpus loading
Creating a custom corpus view
Creating a MongoDB-backed corpus reader
Corpus editing with file locking
Chapter 4: Part-of-speech Tagging
Default tagging
Training a unigram part-of-speech tagger
Combining taggers with backoff tagging
Training and combining ngram taggers
Creating a model of likely word tags
Tagging with regular expressions
Affix tagging
Training a Brill tagger
Training the TnT tagger
Using WordNet for tagging
Tagging proper names
Classifier-based tagging
Training a tagger with NLTK-Trainer
Chapter 5: Extracting Chunks
Chunking and chinking with regular expressions
Merging and splitting chunks with regular expressions
Expanding and removing chunks with regular expressions
Partial parsing with regular expressions
Training a tagger-based chunker
Classification-based chunking
Extracting named entities
Extracting proper noun chunks
Extracting location chunks
Training a named entity chunker
Training a chunker with NLTK-Trainer
Chapter 6: Transforming Chunks and Trees
Filtering insignificant words from a sentence
Correcting verb forms
Swapping verb phrases
Swapping noun cardinals
Swapping infinitive phrases
Singularizing plural nouns
Chaining chunk transformations
Converting a chunk tree to text
Flattening a deep tree
Creating a shallow tree
Converting tree labels
Chapter 7: Text Classification
Bag of words feature extraction
Training a Naive Bayes classifier
Training a decision tree classifier
Training a maximum entropy classifier
Training scikit-learn classifiers
Measuring precision and recall of a classifier
Calculating high information words
Combining classifiers with voting
Classifying with multiple binary classifiers
Training a classifier with NLTK-Trainer
Chapter 8: Distributed Processing and Handling Large Datasets
Distributed tagging with execnet
Distributed chunking with execnet
Parallel list processing with execnet
Storing a frequency distribution in Redis
Storing a conditional frequency distribution in Redis
Storing an ordered dictionary in Redis
Distributed word scoring with Redis and execnet
Chapter 9: Parsing Specific Data Types
Parsing dates and times with dateutil
Timezone lookup and conversion
Extracting URLs from HTML with lxml
Cleaning and stripping HTML
Converting HTML entities with BeautifulSoup
Detecting and converting character encodings

What You Will Learn

  • Tokenize text into sentences, and sentences into words
  • Look up words in the WordNet dictionary
  • Apply spelling correction and word replacement
  • Access the built-in text corpora and create your own custom corpus
  • Tag words with parts of speech
  • Chunk phrases and recognize named entities
  • Grammatically transform phrases and chunks
  • Classify text and perform sentiment analysis

In Detail

This book will show you the essential techniques of text and language processing. Starting with tokenization, stemming, and the WordNet dictionary, you'll progress to part-of-speech tagging, phrase chunking, and named entity recognition. You'll learn how various text corpora are organized, as well as how to create your own custom corpus. Then, you'll move onto text classification with a focus on sentiment analysis. And because NLP can be computationally expensive on large bodies of text, you'll try a few methods for distributed text processing. Finally, you'll be introduced to a number of other small but complementary Python libraries for text analysis, cleaning, and parsing.

This cookbook provides simple, straightforward examples so you can quickly learn text processing with Python and NLTK.


Read More