Reader small image

You're reading from  Python 3 Text Processing with NLTK 3 Cookbook

Product typeBook
Published inAug 2014
Reading LevelBeginner
Publisher
ISBN-139781782167853
Edition1st Edition
Languages
Tools
Right arrow
Author (1)
Jacob Perkins
Jacob Perkins
author image
Jacob Perkins

Jacob Perkins is the cofounder and CTO of Weotta, a local search company. Weotta uses NLP and machine learning to create powerful and easy-to-use natural language search for what to do and where to go. He is the author of Python Text Processing with NLTK 2.0 Cookbook, Packt Publishing, and has contributed a chapter to the Bad Data Handbook, O'Reilly Media. He writes about NLTK, Python, and other technology topics at http://streamhacker.com. To demonstrate the capabilities of NLTK and natural language processing, he developed http://text-processing.com, which provides simple demos and NLP APIs for commercial use. He has contributed to various open source projects, including NLTK, and created NLTK-Trainer to simplify the process of training NLTK models. For more information, visit https://github.com/japerk/nltk-trainer.
Read more about Jacob Perkins

Right arrow

Preface

Natural language processing is used everywhere, from search engines such as Google or Weotta, to voice interfaces such as Siri or Dragon NaturallySpeaking. Python's Natural Language Toolkit (NLTK) is a suite of libraries that has become one of the best tools for prototyping and building natural language processing systems.

Python 3 Text Processing with NLTK 3 Cookbook is your handy and illustrative guide, which will walk you through many natural language processing techniques in a step-by-step manner. It will demystify the dark arts of text mining and language processing using the comprehensive Natural Language Toolkit.

This book cuts short the preamble, ignores pedagogy, and lets you dive right into the techniques of text processing with a practical hands-on approach.

Get started by learning how to tokenize text into words and sentences, then explore the WordNet lexical dictionary. Learn the basics of stemming and lemmatization. Discover various ways to replace words and perform spelling corrections. Create your own corpora and custom corpus readers, including a MongoDB-based corpus reader. Use part-of-speech taggers to annotate words. Create and transform chunked phrase trees and named entities using partial parsing and chunk transformations. Dig into feature extraction and text classification for sentiment analysis. Learn how to process large amount of text with distributed processing and NoSQL databases.

This book will teach you all that and more, in a hands-on learn-by-doing manner. Become an expert in using NLTK for Natural Language Processing with this useful companion.

What this book covers

Chapter 1, Tokenizing Text and WordNet Basics, covers how to tokenize text into sentences and words, then look up those words in the WordNet lexical dictionary.

Chapter 2, Replacing and Correcting Words, demonstrates various word replacement and correction techniques, including stemming, lemmatization, and using the Enchant spelling dictionary.

Chapter 3, Creating Custom Corpora, explains how to use corpus readers and create custom corpora. It also covers how to use some of the corpora that come with NLTK.

Chapter 4, Part-of-speech Tagging, shows how to annotate a sentence of words with part-of-speech tags, and how to train your own custom part-of-speech tagger.

Chapter 5, Extracting Chunks, covers the chunking process, also known as partial parsing, which can identify phrases and named entities in a sentence. It also explains how to train your own custom chunker and create specific named entity recognizers.

Chapter 6, Transforming Chunks and Trees, demonstrates how to transform chunk phrases and parse trees in various ways.

Chapter 7, Text Classification, shows how to transform text into feature dictionaries, and how to train a text classifier for sentiment analysis. It also covers multi-label classification and classifier evaluation metrics.

Chapter 8, Distributed Processing and Handling Large Datasets, discusses how to use execnet for distributed natural language processing and how to use Redis for storing large datasets.

Chapter 9, Parsing Specific Data Types, covers various Python modules that are useful for parsing specific kinds of data, such as datetimes and HTML.

Appendix A, Penn Treebank Part-of-speech Tags, shows a table of Treebank part-of-speech tags, that is a useful reference for Chapter 3, Creating Custom Corpora, and Chapter 4, Part-of-speech Tagging.

What you need for this book

You will need Python 3 and the listed Python packages. For this book, I used Python 3.3.5. To install the packages, you can use pip (https://pypi.python.org/pypi/pip/). The following is the list of the packages in requirements format with the version number used while writing this book:

  • NLTK>=3.0a4

  • pyenchant>=1.6.5

  • lockfile>=0.9.1

  • numpy>=1.8.0

  • scipy>=0.13.0

  • scikit-learn>=0.14.1

  • execnet>=1.1

  • pymongo>=2.6.3

  • redis>=2.8.0

  • lxml>=3.2.3

  • beautifulsoup4>=4.3.2

  • python-dateutil>=2.0

  • charade>=1.0.3

You will also need NLTK-Trainer, which is available at the following link: https://github.com/japerk/nltk-trainer

Beyond Python, there are a couple recipes that use MongoDB and Redis, both NoSQL databases. These can be downloaded at http://www.mongodb.org/ and http://redis.io/, respectively.

Who this book is for

If you are an intermediate to advanced Python programmer who wants to quickly get to grips with using NLTK for natural language processing, this is the book for you. It will help if you are somewhat familiar with basic text processing techniques, such as regular expressions. Programmers with NLTK experience may learn something new, and students of linguistics will find it invaluable.

Conventions

In this book, you will find a number of styles of text that distinguish between different kinds of information. Here are some examples of these styles, and an explanation of their meaning.

Code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles are shown as follows: "The sent_tokenize function uses an instance of PunktSentenceTokenizer from the nltk.tokenize.punkt module."

A block of code is set as follows:

>>> from nltk.tokenize import sent_tokenize
>>> sent_tokenize(para)
['Hello World.', "It's good to see you.", 'Thanks for buying this book.']

When we wish to draw your attention to a particular part of a code block, the relevant lines or items are set in bold:

>>> doc.make_links_absolute('http://hello')
>>> abslinks = list(doc.iterlinks())
>>> (el, attr, link, pos) = abslinks[0]
>>> link
'http://hello/world'

Any command-line input or output is written as follows:

$ python train_chunker.py treebank_chunk

New terms and important words are shown in bold. Words that you see on the screen, in menus or dialog boxes for example, appear in the text like this: "Luckily, this will produce an exception with the message 'DictVectorizer' object has no attribute 'vocabulary_'".

Note

Warnings or important notes appear in a box like this.

Note

Tips and tricks appear like this.

Reader feedback

Feedback from our readers is always welcome. Let us know what you think about this book—what you liked or may have disliked. Reader feedback is important for us to develop titles that you really get the most out of.

To send us general feedback, simply send an e-mail to , and mention the book title via the subject of your message.

If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, see our author guide on www.packtpub.com/authors.

Customer support

Now that you are the proud owner of a Packt book, we have a number of things to help you to get the most from your purchase.

Downloading the example code

You can download the example code files for all Packt books you have purchased from your account at http://www.packtpub.com. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files e-mailed directly to you.

Code for this book is also available at https://github.com/japerk/nltk3-cookbook. This is where you can find named modules mentioned in recipes, such as replacers.py.

Errata

Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you find a mistake in one of our books—maybe a mistake in the text or the code—we would be grateful if you would report this to us. By doing so, you can save other readers from frustration and help us improve subsequent versions of this book. If you find any errata, please report them by visiting http://www.packtpub.com/submit-errata, selecting your book, clicking on the errata submission form link, and entering the details of your errata. Once your errata are verified, your submission will be accepted and the errata will be uploaded on our website, or added to any list of existing errata, under the Errata section of that title. Any existing errata can be viewed by selecting your title from http://www.packtpub.com/support.

Piracy

Piracy of copyright material on the Internet is an ongoing problem across all media. At Packt, we take the protection of our copyright and licenses very seriously. If you come across any illegal copies of our works, in any form, on the Internet, please provide us with the location address or website name immediately so that we can pursue a remedy.

Please contact us at with a link to the suspected pirated material.

We appreciate your help in protecting our authors, and our ability to bring you valuable content.

Questions

You can contact us at if you are having a problem with any aspect of the book, and we will do our best to address it.

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Python 3 Text Processing with NLTK 3 Cookbook
Published in: Aug 2014Publisher: ISBN-13: 9781782167853
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Jacob Perkins

Jacob Perkins is the cofounder and CTO of Weotta, a local search company. Weotta uses NLP and machine learning to create powerful and easy-to-use natural language search for what to do and where to go. He is the author of Python Text Processing with NLTK 2.0 Cookbook, Packt Publishing, and has contributed a chapter to the Bad Data Handbook, O'Reilly Media. He writes about NLTK, Python, and other technology topics at http://streamhacker.com. To demonstrate the capabilities of NLTK and natural language processing, he developed http://text-processing.com, which provides simple demos and NLP APIs for commercial use. He has contributed to various open source projects, including NLTK, and created NLTK-Trainer to simplify the process of training NLTK models. For more information, visit https://github.com/japerk/nltk-trainer.
Read more about Jacob Perkins