Python Text Processing with NLTK 2.0 Cookbook

Jacob Perkins

500 Internal Server Error

500 Internal Server Error


nginx
Free Shipping! UK, US, Europe and selected countries in Asia.
Also available on:
Overview
Table of Contents
The Author
Reviews
Downloads
  • Quickly get to grips with Natural Language Processing – with Text Analysis, Text Mining, and beyond
  • Learn how machines and crawlers interpret and process natural languages
  • Easily work with huge amounts of data and learn how to handle distributed processing
  • Part of Packt's Cookbook series: Each recipe is a carefully organized sequence of instructions to complete the task as efficiently as possible

Book Details

Language : English
Paperback : 272 pages [ 235mm x 191mm ]
Release Date : November 2010
ISBN : 1849513600
ISBN 13 : 978-1-84951-360-9
Author(s) : Jacob Perkins
Topics and Technologies : All Books, Cookbooks, Open Source


Table of Contents

Preface
Chapter 1: Tokenizing Text and WordNet Basics
Chapter 2: Replacing and Correcting Words
Chapter 3: Creating Custom Corpora
Chapter 4: Part-of-Speech Tagging
Chapter 5: Extracting Chunks
Chapter 6: Transforming Chunks and Trees
Chapter 7: Text Classification
Chapter 8: Distributed Processing and Handling Large Datasets
Chapter 9: Parsing Specific Data
Appendix: Penn Treebank Part-of-Speech Tags
Index
  • Chapter 1: Tokenizing Text and WordNet Basics
    • Introduction
    • Tokenizing text into sentences
    • Tokenizing sentences into words
    • Tokenizing sentences using regular expressions
    • Filtering stopwords in a tokenized sentence
    • Looking up synsets for a word in WordNet
    • Looking up lemmas and synonyms in WordNet
    • Calculating WordNet synset similarity
    • Discovering word collocations
  • Chapter 2: Replacing and Correcting Words
    • Introduction
    • Stemming words
    • Lemmatizing words with WordNet
    • Translating text with Babelfish
    • Replacing words matching regular expressions
    • Removing repeating characters
    • Spelling correction with Enchant
    • Replacing synonyms
    • Replacing negations with antonyms
  • Chapter 3: Creating Custom Corpora
    • Introduction
    • Setting up a custom corpus
    • Creating a word list corpus
    • Creating a part-of-speech tagged word corpus
    • Creating a chunked phrase corpus
    • Creating a categorized text corpus
    • Creating a categorized chunk corpus reader
    • Lazy corpus loading
    • Creating a custom corpus view
    • Creating a MongoDB backed corpus reader
    • Corpus editing with file locking
  • Chapter 4: Part-of-Speech Tagging
    • Introduction
    • Default tagging
    • Training a unigram part-of-speech tagger
    • Combining taggers with backoff tagging
    • Training and combining Ngram taggers
    • Creating a model of likely word tags
    • Tagging with regular expressions
    • Affix tagging
    • Training a Brill tagger
    • Training the TnT tagger
    • Using WordNet for tagging
    • Tagging proper names
    • Classifier based tagging
  • Chapter 5: Extracting Chunks
    • Introduction
    • Chunking and chinking with regular expressions
    • Merging and splitting chunks with regular expressions
    • Expanding and removing chunks with regular expressions
    • Partial parsing with regular expressions
    • Training a tagger-based chunker
    • Classification-based chunking
    • Extracting named entities
    • Extracting proper noun chunks
    • Extracting location chunks
    • Training a named entity chunker
  • Chapter 6: Transforming Chunks and Trees
    • Introduction
    • Filtering insignificant words
    • Correcting verb forms
    • Swapping verb phrases
    • Swapping noun cardinals
    • Swapping infinitive phrases
    • Singularizing plural nouns
    • Chaining chunk transformations
    • Converting a chunk tree to text
    • Flattening a deep tree
    • Creating a shallow tree
    • Converting tree nodes
  • Chapter 7: Text Classification
    • Introduction
    • Bag of Words feature extraction
    • Training a naive Bayes classifier
    • Training a decision tree classifier
    • Training a maximum entropy classifier
    • Measuring precision and recall of a classifier
    • Calculating high information words
    • Combining classifiers with voting
    • Classifying with multiple binary classifiers
  • Chapter 8: Distributed Processing and Handling Large Datasets
    • Introduction
    • Distributed tagging with execnet
    • Distributed chunking with execnet
    • Parallel list processing with execnet
    • Storing a frequency distribution in Redis
    • Storing a conditional frequency distribution in Redis
    • Storing an ordered dictionary in Redis
    • Distributed word scoring with Redis and execnet
  • Chapter 9: Parsing Specific Data
    • Introduction
    • Parsing dates and times with Dateutil
    • Time zone lookup and conversion
    • Tagging temporal expressions with Timex
    • Extracting URLs from HTML with lxml
    • Cleaning and stripping HTML
    • Converting HTML entities with BeautifulSoup
    • Detecting and converting character encodings

Jacob Perkins

Jacob Perkins has been an avid user of open source software since high school, when he first built his own computer and didn't want to pay for Windows. At one point he had 5 operating systems installed, including RedHat Linux, OpenBSD, and BeOS.

While at Washington University in St. Louis, Jacob took classes in Spanish, poetry writing, and worked on an independent study project that eventually became his Master's Project: WUGLE – a GUI for manipulating logical expressions. In his free time, he wrote the Gnome2 version of Seahorse (a GUI for encryption and key management), which has since been translated into over a dozen languages and is included in the default Gnome distribution.

After getting his MS in Computer Science, Jacob tried to start a web development studio with some friends, but since no-one knew anything about web development, it didn't work out as planned. Once he'd actually learned web development, he went off and co-founded another company called Weotta, which sparked his interest in Machine Learning and Natural Language Processing.

Jacob is currently the CTO / Chief Hacker for Weotta and blogs about what he's learned along the way at http://streamhacker.com/. He is also applying this knowledge to produce text processing APIs and demos at http://text-processing.com/. This book is a synthesis of his knowledge on processing text using Python, NLTK, and more.

Sorry, we don't have any reviews for this title yet.

Sample chapters

You can view our sample chapters and prefaces of this title on PacktLib or download sample chapters in PDF format.

Find your book in our support section to find errata and to download code samples.

What you will learn from this book

  • Learn Text categorization and Topic identification
  • Learn Stemming and Lemmatization and how to go beyond the usual spell checker
  • Replace negations with antonyms in your text
  • Learn to tokenize words into lists of sentences and words, and gain an insight into WordNet
  • Transform and manipulate chunks and trees
  • Learn advanced features of corpus readers and create your own custom corpora
  • Tag different parts of speech by creating, training, and using a part-of-speech tagger
  • Improve accuracy by combining multiple part-of-speech taggers
  • Learn how to do partial parsing to extract small chunks of text from a part-of-speech tagged sentence
  • Produce an alternative canonical form without changing the meaning by normalizing parsed chunks
  • Learn how search engines use Natural Language Processing to process text
  • Make your site more discoverable by learning how to automatically replace words with more searched equivalents
  • Parse dates, times, and HTML
  • Train and manipulate different types of classifiers

Special Offers

PacktLib gives you access to this and 600+ other titles with an annual or monthly subscription.
500 Internal Server Error

500 Internal Server Error


nginx
Buy 2 eBooks
and Get 50% Off
+
Buy Python Text Processing with NLTK 2.0 Cookbook with Python 3 Web Development Beginner's Guide and get 50% off both the eBooks.
Just add both the eBooks to your shopping cart and enter pytpw9ebk in the 'Enter Promotion Code' field. Click 'Add Promotion Code' and the discount will be applied.
View Best Selling eBook offers

In Detail

Natural Language Processing is used everywhere – in search engines, spell checkers, mobile phones, computer games – even your washing machine. Python's Natural Language Toolkit (NLTK) suite of libraries has rapidly emerged as one of the most efficient tools for Natural Language Processing. You want to employ nothing less than the best techniques in Natural Language Processing – and this book is your answer.

Python Text Processing with NLTK 2.0 Cookbook is your handy and illustrative guide, which will walk you through all the Natural Language Processing techniques in a step–by-step manner. It will demystify the advanced features of text analysis and text mining using the comprehensive NLTK suite.

This book cuts short the preamble and you dive right into the science of text processing with a practical hands-on approach.

Get started off with learning tokenization of text. Get an overview of WordNet and how to use it. Learn the basics as well as advanced features of Stemming and Lemmatization. Discover various ways to replace words with simpler and more common (read: more searched) variants. Create your own corpora and learn to create custom corpus readers for JSON files as well as for data stored in MongoDB. Use and manipulate POS taggers. Transform and normalize parsed chunks to produce a canonical form without changing their meaning. Dig into feature extraction and text classification. Learn how to easily handle huge amounts of data without any loss in efficiency or speed.

This book will teach you all that and beyond, in a hands-on learn-by-doing manner. Make yourself an expert in using the NLTK for Natural Language Processing with this handy companion.

Approach

The learn-by-doing approach of this book will enable you to dive right into the heart of text processing from the very first page. Each recipe is carefully designed to fulfill your appetite for Natural Language Processing. Packed with numerous illustrative examples and code samples, it will make the task of using the NLTK for Natural Language Processing easy and straightforward.

Who this book is for

This book is for Python programmers who want to quickly get to grips with using the NLTK for Natural Language Processing. Familiarity with basic text processing concepts is required. Programmers experienced in the NLTK will also find it useful. Students of linguistics will find it invaluable.

Are there no books available that are right for you at the moment? How about signing up to our newsletter to keep up to date?
Awards Voting Nominations Previous Winners
Judges Open Source CMS Hall Of Fame CMS Most Promising Open Source Project Open Source E-Commerce Applications Open Source JavaScript Library Open Source Graphics Software
Resources
Open Source CMS Hall Of Fame CMS Most Promising Open Source Project Open Source E-Commerce Applications Open Source JavaScript Library Open Source Graphics Software
Sort A-Z