Packt+ | Advance your knowledge in tech

You're reading from Python 3 Text Processing with NLTK 3 Cookbook

Product typeBook

Published inAug 2014

Reading LevelBeginner

Publisher

ISBN-139781782167853

Edition1st Edition

Languages

Python

Tools

NLTK

Concepts

Data Processing

Author (1)

Jacob Perkins

Preface

Natural language processing is used everywhere, from search engines such as Google or Weotta, to voice interfaces such as Siri or Dragon NaturallySpeaking. Python's Natural Language Toolkit (NLTK) is a suite of libraries that has become one of the best tools for prototyping and building natural language processing systems.

Python 3 Text Processing with NLTK 3 Cookbook is your handy and illustrative guide, which will walk you through many natural language processing techniques in a step-by-step manner. It will demystify the dark arts of text mining and language processing using the comprehensive Natural Language Toolkit.

This book cuts short the preamble, ignores pedagogy, and lets you dive right into the techniques of text processing with a practical hands-on approach.

Get started by learning how to tokenize text into words and sentences, then explore the WordNet lexical dictionary. Learn the basics of stemming and lemmatization. Discover various ways to replace words and perform spelling corrections. Create your own corpora and custom corpus readers, including a MongoDB-based corpus reader. Use part-of-speech taggers to annotate words. Create and transform chunked phrase trees and named entities using partial parsing and chunk transformations. Dig into feature extraction and text classification for sentiment analysis. Learn how to process large amount of text with distributed processing and NoSQL databases.

This book will teach you all that and more, in a hands-on learn-by-doing manner. Become an expert in using NLTK for Natural Language Processing with this useful companion.

What this book covers

Chapter 1, Tokenizing Text and WordNet Basics, covers how to tokenize text into sentences and words, then look up those words in the WordNet lexical dictionary.

Chapter 2, Replacing and Correcting Words, demonstrates various word replacement and correction techniques, including stemming, lemmatization, and using the Enchant spelling dictionary.

Chapter 3, Creating Custom Corpora, explains how to use corpus readers and create custom corpora. It also covers how to use some of the corpora that come with NLTK.

Chapter 4, Part-of-speech Tagging, shows how to annotate a sentence of words with part-of-speech tags, and how to train your own custom part-of-speech tagger.

Chapter 5, Extracting Chunks, covers the chunking process, also known as partial parsing, which can identify phrases and named entities in a sentence. It also explains how to train your own custom chunker and create specific named entity recognizers.

Chapter 6, Transforming Chunks and Trees, demonstrates how to transform chunk phrases and parse trees in various ways.

Chapter 7, Text Classification, shows how to transform text into feature dictionaries, and how to train a text classifier for sentiment analysis. It also covers multi-label classification and classifier evaluation metrics.

Chapter 8, Distributed Processing and Handling Large Datasets, discusses how to use execnet for distributed natural language processing and how to use Redis for storing large datasets.

Chapter 9, Parsing Specific Data Types, covers various Python modules that are useful for parsing specific kinds of data, such as datetimes and HTML.

Appendix A, Penn Treebank Part-of-speech Tags, shows a table of Treebank part-of-speech tags, that is a useful reference for Chapter 3, Creating Custom Corpora, and Chapter 4, Part-of-speech Tagging.

What you need for this book

You will need Python 3 and the listed Python packages. For this book, I used Python 3.3.5. To install the packages, you can use pip (https://pypi.python.org/pypi/pip/). The following is the list of the packages in requirements format with the version number used while writing this book:

NLTK>=3.0a4
pyenchant>=1.6.5
lockfile>=0.9.1
numpy>=1.8.0
scipy>=0.13.0
scikit-learn>=0.14.1
execnet>=1.1
pymongo>=2.6.3
redis>=2.8.0
lxml>=3.2.3
beautifulsoup4>=4.3.2
python-dateutil>=2.0
charade>=1.0.3

You will also need NLTK-Trainer, which is available at the following link: https://github.com/japerk/nltk-trainer

Beyond Python, there are a couple recipes that use MongoDB and Redis, both NoSQL databases. These can be downloaded at http://www.mongodb.org/ and http://redis.io/, respectively.

Who this book is for

If you are an intermediate to advanced Python programmer who wants to quickly get to grips with using NLTK for natural language processing, this is the book for you. It will help if you are somewhat familiar with basic text processing techniques, such as regular expressions. Programmers with NLTK experience may learn something new, and students of linguistics will find it invaluable.

Conventions

In this book, you will find a number of styles of text that distinguish between different kinds of information. Here are some examples of these styles, and an explanation of their meaning.

Code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles are shown as follows: "The sent_tokenize function uses an instance of PunktSentenceTokenizer from the nltk.tokenize.punkt module."

A block of code is set as follows:

>>> from nltk.tokenize import sent_tokenize
>>> sent_tokenize(para)
['Hello World.', "It's good to see you.", 'Thanks for buying this book.']

When we wish to draw your attention to a particular part of a code block, the relevant lines or items are set in bold:

>>> doc.make_links_absolute('http://hello')
>>> abslinks = list(doc.iterlinks())
>>> (el, attr, link, pos) = abslinks[0]
>>> link
'http://hello/world'

Any command-line input or output is written as follows:

$ python train_chunker.py treebank_chunk

New terms and important words are shown in bold. Words that you see on the screen, in menus or dialog boxes for example, appear in the text like this: "Luckily, this will produce an exception with the message 'DictVectorizer' object has no attribute 'vocabulary_'".

Note

Warnings or important notes appear in a box like this.

Note

Tips and tricks appear like this.

Reader feedback

Feedback from our readers is always welcome. Let us know what you think about this book—what you liked or may have disliked. Reader feedback is important for us to develop titles that you really get the most out of.

To send us general feedback, simply send an e-mail to <feedback@packtpub.com>, and mention the book title via the subject of your message.

If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, see our author guide on www.packtpub.com/authors.

Customer support

Now that you are the proud owner of a Packt book, we have a number of things to help you to get the most from your purchase.

Downloading the example code

You can download the example code files for all Packt books you have purchased from your account at http://www.packtpub.com. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files e-mailed directly to you.

Code for this book is also available at https://github.com/japerk/nltk3-cookbook. This is where you can find named modules mentioned in recipes, such as replacers.py.

Errata

Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you find a mistake in one of our books—maybe a mistake in the text or the code—we would be grateful if you would report this to us. By doing so, you can save other readers from frustration and help us improve subsequent versions of this book. If you find any errata, please report them by visiting http://www.packtpub.com/submit-errata, selecting your book, clicking on the errata submission form link, and entering the details of your errata. Once your errata are verified, your submission will be accepted and the errata will be uploaded on our website, or added to any list of existing errata, under the Errata section of that title. Any existing errata can be viewed by selecting your title from http://www.packtpub.com/support.

Piracy

Piracy of copyright material on the Internet is an ongoing problem across all media. At Packt, we take the protection of our copyright and licenses very seriously. If you come across any illegal copies of our works, in any form, on the Internet, please provide us with the location address or website name immediately so that we can pursue a remedy.

Please contact us at <copyright@packtpub.com> with a link to the suspected pirated material.

We appreciate your help in protecting our authors, and our ability to bring you valuable content.

Questions

You can contact us at <questions@packtpub.com> if you are having a problem with any aspect of the book, and we will do our best to address it.

The rest of the chapter is locked

You have been reading a chapter from

Python 3 Text Processing with NLTK 3 Cookbook

Published in: Aug 2014Publisher: ISBN-13: 9781782167853

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at $15.99/month. Cancel anytime

Author (1)

Jacob Perkins

Jacob Perkins is the cofounder and CTO of Weotta, a local search company. Weotta uses NLP and machine learning to create powerful and easy-to-use natural language search for what to do and where to go. He is the author of Python Text Processing with NLTK 2.0 Cookbook, Packt Publishing, and has contributed a chapter to the Bad Data Handbook, O'Reilly Media. He writes about NLTK, Python, and other technology topics at http://streamhacker.com. To demonstrate the capabilities of NLTK and natural language processing, he developed http://text-processing.com, which provides simple demos and NLP APIs for commercial use. He has contributed to various open source projects, including NLTK, and created NLTK-Trainer to simplify the process of training NLTK models. For more information, visit https://github.com/japerk/nltk-trainer.
Read more about Jacob Perkins

Personalised recommendations for you

Based on your interests and search pattern

Et al.

Ever wonder why speech recognition systems don't understand the Scottish accent, or what would happen if an astronaut only ate mac 'n' cheese, or other spurious reflections you'd have at a bar? We did, then collated those deliberations into absurd research articles with fake figures and methodologies inspired by even more fictionally absurd studies.

BookAug 2023230 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages4

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages1

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Mastering Tableau 2023

This book is a comprehensive resource to mastering your Tableau skills and becoming a BI expert. As you progress, you will learn how to build advanced dashboards and improve your storytelling to derive key business insight, as well as make you well-versed with advanced functionalities of Tableau in the business intelligence domain.

BookAug 2023684 pages

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages5

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages2

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

BookOct 2023636 pages5

Modern Data Architecture on AWS

Every organization wants an agile, performant, and cost-effective data platform that meets all their current and future business needs. Purpose-built AWS analytics services and their features play a big part in building such a modern data platform. This book brings to you all the design and architectural patterns that’ll help you achieve this goal.

BookAug 2023420 pages5

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

BookDec 2023240 pages

TinyML Cookbook

With over 70 project-based recipes, the TinyML Cookbook is a practical guide that will help you to get the most out of your microcontrollers. It provides a comprehensive understanding of the theoretical foundations while giving you hands-on experience training ML models for deployment on Arduino Nano 33 BLE Sense, Raspberry Pi Pico, and SparkFun RedBoard Artemis Nano microcontrollers.

BookNov 2023664 pages