Python Text Processing with NLTK 2.0 Cookbook

Use Python's NLTK suite of libraries to maximize your Natural Language Processing capabilities.

Python Text Processing with NLTK 2.0 Cookbook

Cookbook
Jacob Perkins

Use Python's NLTK suite of libraries to maximize your Natural Language Processing capabilities.
$10.00
$36.00
RRP $23.99
RRP $39.99
eBook
Print + eBook
$12.99 p/month

Get Access

Get Unlimited Access to every Packt eBook and Video course

Enjoy full and instant access to over 3000 books and videos – you’ll find everything you need to stay ahead of the curve and make sure you can always get the job done.

+ Collection
Free Sample

Book Details

ISBN 139781849513609
Paperback272 pages

About This Book

  • Quickly get to grips with Natural Language Processing – with Text Analysis, Text Mining, and beyond
  • Learn how machines and crawlers interpret and process natural languages
  • Easily work with huge amounts of data and learn how to handle distributed processing
  • Part of Packt's Cookbook series: Each recipe is a carefully organized sequence of instructions to complete the task as efficiently as possible

Who This Book Is For

This book is for Python programmers who want to quickly get to grips with using the NLTK for Natural Language Processing. Familiarity with basic text processing concepts is required. Programmers experienced in the NLTK will also find it useful. Students of linguistics will find it invaluable.

Table of Contents

Chapter 1: Tokenizing Text and WordNet Basics
Introduction
Tokenizing text into sentences
Tokenizing sentences into words
Tokenizing sentences using regular expressions
Filtering stopwords in a tokenized sentence
Looking up synsets for a word in WordNet
Looking up lemmas and synonyms in WordNet
Calculating WordNet synset similarity
Discovering word collocations
Chapter 2: Replacing and Correcting Words
Introduction
Stemming words
Lemmatizing words with WordNet
Translating text with Babelfish
Replacing words matching regular expressions
Removing repeating characters
Spelling correction with Enchant
Replacing synonyms
Replacing negations with antonyms
Chapter 3: Creating Custom Corpora
Introduction
Setting up a custom corpus
Creating a word list corpus
Creating a part-of-speech tagged word corpus
Creating a chunked phrase corpus
Creating a categorized text corpus
Creating a categorized chunk corpus reader
Lazy corpus loading
Creating a custom corpus view
Creating a MongoDB backed corpus reader
Corpus editing with file locking
Chapter 4: Part-of-Speech Tagging
Introduction
Default tagging
Training a unigram part-of-speech tagger
Combining taggers with backoff tagging
Training and combining Ngram taggers
Creating a model of likely word tags
Tagging with regular expressions
Affix tagging
Training a Brill tagger
Training the TnT tagger
Using WordNet for tagging
Tagging proper names
Classifier based tagging
Chapter 5: Extracting Chunks
Introduction
Chunking and chinking with regular expressions
Merging and splitting chunks with regular expressions
Expanding and removing chunks with regular expressions
Partial parsing with regular expressions
Training a tagger-based chunker
Classification-based chunking
Extracting named entities
Extracting proper noun chunks
Extracting location chunks
Training a named entity chunker
Chapter 6: Transforming Chunks and Trees
Introduction
Filtering insignificant words
Correcting verb forms
Swapping verb phrases
Swapping noun cardinals
Swapping infinitive phrases
Singularizing plural nouns
Chaining chunk transformations
Converting a chunk tree to text
Flattening a deep tree
Creating a shallow tree
Converting tree nodes
Chapter 7: Text Classification
Introduction
Bag of Words feature extraction
Training a naive Bayes classifier
Training a decision tree classifier
Training a maximum entropy classifier
Measuring precision and recall of a classifier
Calculating high information words
Combining classifiers with voting
Classifying with multiple binary classifiers
Chapter 8: Distributed Processing and Handling Large Datasets
Introduction
Distributed tagging with execnet
Distributed chunking with execnet
Parallel list processing with execnet
Storing a frequency distribution in Redis
Storing a conditional frequency distribution in Redis
Storing an ordered dictionary in Redis
Distributed word scoring with Redis and execnet
Chapter 9: Parsing Specific Data
Introduction
Parsing dates and times with Dateutil
Time zone lookup and conversion
Tagging temporal expressions with Timex
Extracting URLs from HTML with lxml
Cleaning and stripping HTML
Converting HTML entities with BeautifulSoup
Detecting and converting character encodings

What You Will Learn

  • Learn Text categorization and Topic identification
  • Learn Stemming and Lemmatization and how to go beyond the usual spell checker
  • Replace negations with antonyms in your text
  • Learn to tokenize words into lists of sentences and words, and gain an insight into WordNet
  • Transform and manipulate chunks and trees
  • Learn advanced features of corpus readers and create your own custom corpora
  • Tag different parts of speech by creating, training, and using a part-of-speech tagger
  • Improve accuracy by combining multiple part-of-speech taggers
  • Learn how to do partial parsing to extract small chunks of text from a part-of-speech tagged sentence
  • Produce an alternative canonical form without changing the meaning by normalizing parsed chunks
  • Learn how search engines use Natural Language Processing to process text
  • Make your site more discoverable by learning how to automatically replace words with more searched equivalents
  • Parse dates, times, and HTML
  • Train and manipulate different types of classifiers

In Detail

Natural Language Processing is used everywhere – in search engines, spell checkers, mobile phones, computer games – even your washing machine. Python's Natural Language Toolkit (NLTK) suite of libraries has rapidly emerged as one of the most efficient tools for Natural Language Processing. You want to employ nothing less than the best techniques in Natural Language Processing – and this book is your answer.

Python Text Processing with NLTK 2.0 Cookbook is your handy and illustrative guide, which will walk you through all the Natural Language Processing techniques in a step–by-step manner. It will demystify the advanced features of text analysis and text mining using the comprehensive NLTK suite.

This book cuts short the preamble and you dive right into the science of text processing with a practical hands-on approach.

Get started off with learning tokenization of text. Get an overview of WordNet and how to use it. Learn the basics as well as advanced features of Stemming and Lemmatization. Discover various ways to replace words with simpler and more common (read: more searched) variants. Create your own corpora and learn to create custom corpus readers for JSON files as well as for data stored in MongoDB. Use and manipulate POS taggers. Transform and normalize parsed chunks to produce a canonical form without changing their meaning. Dig into feature extraction and text classification. Learn how to easily handle huge amounts of data without any loss in efficiency or speed.

This book will teach you all that and beyond, in a hands-on learn-by-doing manner. Make yourself an expert in using the NLTK for Natural Language Processing with this handy companion.

Authors

Table of Contents

Chapter 1: Tokenizing Text and WordNet Basics
Introduction
Tokenizing text into sentences
Tokenizing sentences into words
Tokenizing sentences using regular expressions
Filtering stopwords in a tokenized sentence
Looking up synsets for a word in WordNet
Looking up lemmas and synonyms in WordNet
Calculating WordNet synset similarity
Discovering word collocations
Chapter 2: Replacing and Correcting Words
Introduction
Stemming words
Lemmatizing words with WordNet
Translating text with Babelfish
Replacing words matching regular expressions
Removing repeating characters
Spelling correction with Enchant
Replacing synonyms
Replacing negations with antonyms
Chapter 3: Creating Custom Corpora
Introduction
Setting up a custom corpus
Creating a word list corpus
Creating a part-of-speech tagged word corpus
Creating a chunked phrase corpus
Creating a categorized text corpus
Creating a categorized chunk corpus reader
Lazy corpus loading
Creating a custom corpus view
Creating a MongoDB backed corpus reader
Corpus editing with file locking
Chapter 4: Part-of-Speech Tagging
Introduction
Default tagging
Training a unigram part-of-speech tagger
Combining taggers with backoff tagging
Training and combining Ngram taggers
Creating a model of likely word tags
Tagging with regular expressions
Affix tagging
Training a Brill tagger
Training the TnT tagger
Using WordNet for tagging
Tagging proper names
Classifier based tagging
Chapter 5: Extracting Chunks
Introduction
Chunking and chinking with regular expressions
Merging and splitting chunks with regular expressions
Expanding and removing chunks with regular expressions
Partial parsing with regular expressions
Training a tagger-based chunker
Classification-based chunking
Extracting named entities
Extracting proper noun chunks
Extracting location chunks
Training a named entity chunker
Chapter 6: Transforming Chunks and Trees
Introduction
Filtering insignificant words
Correcting verb forms
Swapping verb phrases
Swapping noun cardinals
Swapping infinitive phrases
Singularizing plural nouns
Chaining chunk transformations
Converting a chunk tree to text
Flattening a deep tree
Creating a shallow tree
Converting tree nodes
Chapter 7: Text Classification
Introduction
Bag of Words feature extraction
Training a naive Bayes classifier
Training a decision tree classifier
Training a maximum entropy classifier
Measuring precision and recall of a classifier
Calculating high information words
Combining classifiers with voting
Classifying with multiple binary classifiers
Chapter 8: Distributed Processing and Handling Large Datasets
Introduction
Distributed tagging with execnet
Distributed chunking with execnet
Parallel list processing with execnet
Storing a frequency distribution in Redis
Storing a conditional frequency distribution in Redis
Storing an ordered dictionary in Redis
Distributed word scoring with Redis and execnet
Chapter 9: Parsing Specific Data
Introduction
Parsing dates and times with Dateutil
Time zone lookup and conversion
Tagging temporal expressions with Timex
Extracting URLs from HTML with lxml
Cleaning and stripping HTML
Converting HTML entities with BeautifulSoup
Detecting and converting character encodings

Book Details

ISBN 139781849513609
Paperback272 pages
Read More