Packt+ | Advance your knowledge in tech

You're reading from Python 3 Text Processing with NLTK 3 Cookbook

Product typeBook

Published inAug 2014

Reading LevelBeginner

Publisher

ISBN-139781782167853

Edition1st Edition

Languages

Python

Tools

NLTK

Concepts

Data Processing

Author (1)

Jacob Perkins

Chapter 9. Parsing Specific Data Types

In this chapter, we will cover the following recipes:

Parsing dates and times with dateutil
Timezone lookup and conversion
Extracting URLs from HTML with lxml
Cleaning and stripping HTML
Converting HTML entities with BeautifulSoup
Detecting and converting character encodings

Introduction

This chapter covers parsing specific kinds of data, focusing primarily on dates, times, and HTML. Luckily, there are a number of useful libraries to accomplish this, so we don't have to delve into tricky and overly complicated regular expressions. These libraries can be great complements to NLTK:

dateutil provides datetime parsing and timezone conversion
lxml and BeautifulSoup can parse, clean, and convert HTML
charade and UnicodeDammit can detect and convert text character encoding

These libraries can be useful for preprocessing text before passing it to an NLTK object, or postprocessing text that has been processed and extracted using NLTK. Coming up is an example that ties many of these tools together.

Let's say you need to parse a blog article about a restaurant. You can use lxml or BeautifulSoup to extract the article text, outbound links, and the date and time when the article was written. The date and time can then be parsed to a Python datetime object with dateutil. Once...

Parsing dates and times with dateutil

If you need to parse dates and times in Python, there is no better library than dateutil. The parser module can parse datetime strings in many more formats than can be shown here, while the tz module provides everything you need for looking up timezones. When combined, these modules make it quite easy to parse strings into timezone-aware datetime objects.

Getting ready

You can install dateutil using pip or easy_install, that is, sudo pip install dateutil==2.0 or sudo easy_install dateutil==2.0. You need the 2.0 version for Python 3 compatibility. The complete documentation can be found at http://labix.org/python-dateutil.

How to do it...

Let's dive into a few parsing examples:

>>> from dateutil import parser
>>> parser.parse('Thu Sep 25 10:36:28 2010')
datetime.datetime(2010, 9, 25, 10, 36, 28)
>>> parser.parse('Thursday, 25. September 2010 10:36AM')
datetime.datetime(2010, 9, 25, 10, 36)
>>> parser.parse('9/25/2010 10...

Timezone lookup and conversion

Most datetime objects returned from the dateutil parser are naïve, meaning they don't have an explicit tzinfo, which specifies the timezone and UTC offset. In the previous recipe, only one of the examples had a tzinfo, and that's because it's in the standard ISO format for UTC datetime strings. UTC is the coordinated universal time, and is basically the same as GMT. ISO is the International Standards Organization, which among other things, specifies standard datetime formatting.

Python datetime objects can either be naïve or aware. If a datetime object has a tzinfo, then it is aware. Otherwise, the datetime is naïve. To make a naïve datetime object timezone aware, you must give it an explicit tzinfo. However, the Python datetime library only defines an abstract baseclass for tzinfo, and leaves it up to others to actually implement tzinfo creation. This is where the tz module of dateutil comes in—it provides everything you need to look up timezones from your...

Extracting URLs from HTML with lxml

A common task when parsing HTML is extracting links. This is one of the core functions of every general web crawler. There are a number of Python libraries for parsing HTML, and lxml is one of the best. As you'll see, it comes with some great helper functions geared specifically towards link extraction.

Getting ready

lxml is a Python binding for the C libraries libxml2 and libxslt. This makes it a very fast XML and HTML parsing library, while still being Pythonic. But that also means you need to install the C libraries for it to work. Installation instructions are available at http://lxml.de/installation.html. But if you're running Ubuntu Linux, installation is as easy as sudo apt-get install python-lxml. You can also try doing pip install lxml. The latest version as of this writing is 3.3.5.

How to do it...

lxml comes with an html module designed specifically for parsing HTML. Using the fromstring() function, we can parse an HTML string and get a list of...

Cleaning and stripping HTML

Cleaning up text is one of the unfortunate but entirely necessary aspects of text processing. When it comes to parsing HTML, you probably don't want to deal with any embedded JavaScript or CSS, and are only interested in the tags and text.

Getting ready

You'll need to install lxml. See the previous recipe or http://lxml.de/installation.html for installation instructions.

How to do it...

We can use the clean_html() function in the lxml.html.clean module to remove unnecessary HTML tags and embedded JavaScript from an HTML string:

>>> import lxml.html.clean
>>> lxml.html.clean.clean_html('<html><head></head><body onload=loadfunc()>my text</body></html>')
'<div><body>my text</body></div>'

The result is much cleaner and easier to deal with.

How it works...

The lxml.html.clean_html() function parses the HTML string into a tree and then iterates over and removes all nodes that should be removed. It...

Converting HTML entities with BeautifulSoup

HTML entities are strings such as "&" or "<". These are encodings of normal ASCII characters that have special uses in HTML. For example, "<" is the entity for "<", but you can't just have "<" within HTML tags because it is the beginning character for an HTML tag, hence the need to escape it and define the "<" entity. "&" is the entity code for "&", which as we've just seen is the beginning character for an entity code. If you need to process the text within an HTML document, then you'll want to convert these entities back to their normal characters so you can recognize them and handle them appropriately.

Getting ready

You'll need to install BeautifulSoup, which you should be able to do with sudo pip install beautifulsoup4 or sudo easy_install beautifulsoup4. You can read more about BeautifulSoup at http://www.crummy.com/software/BeautifulSoup/.

How to do it...

BeautifulSoup is an HTML parser library...

Detecting and converting character encodings

A common occurrence with text processing is finding text that has nonstandard character encoding. Ideally, all text would be ASCII or utf-8, but that's just not the reality. In cases when you have non-ASCII or non-utf-8 text and you don't know what the character encoding is, you'll need to detect it and convert the text to a standard encoding before doing further processing.

Getting ready

You'll need to install the charade module using sudo pip install charade or sudo easy_install charade. You can learn more about charade at https://pypi.python.org/pypi/charade.

How to do it...

Encoding detection and conversion functions are provided in encoding.py. These are simple wrapper functions around the charade module. To detect the encoding of a string, call encoding.detect(string). You'll get back a dict containing two attributes: confidence and encoding. The confidence attribute is a probability of how confident charade is that the value for encoding is...

The rest of the chapter is locked

You have been reading a chapter from

Python 3 Text Processing with NLTK 3 Cookbook

Published in: Aug 2014Publisher: ISBN-13: 9781782167853

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at $15.99/month. Cancel anytime

Author (1)

Jacob Perkins

Jacob Perkins is the cofounder and CTO of Weotta, a local search company. Weotta uses NLP and machine learning to create powerful and easy-to-use natural language search for what to do and where to go. He is the author of Python Text Processing with NLTK 2.0 Cookbook, Packt Publishing, and has contributed a chapter to the Bad Data Handbook, O'Reilly Media. He writes about NLTK, Python, and other technology topics at http://streamhacker.com. To demonstrate the capabilities of NLTK and natural language processing, he developed http://text-processing.com, which provides simple demos and NLP APIs for commercial use. He has contributed to various open source projects, including NLTK, and created NLTK-Trainer to simplify the process of training NLTK models. For more information, visit https://github.com/japerk/nltk-trainer.
Read more about Jacob Perkins

Personalised recommendations for you

Based on your interests and search pattern

Et al.

Ever wonder why speech recognition systems don't understand the Scottish accent, or what would happen if an astronaut only ate mac 'n' cheese, or other spurious reflections you'd have at a bar? We did, then collated those deliberations into absurd research articles with fake figures and methodologies inspired by even more fictionally absurd studies.

BookAug 2023230 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages4

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages1

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Mastering Tableau 2023

This book is a comprehensive resource to mastering your Tableau skills and becoming a BI expert. As you progress, you will learn how to build advanced dashboards and improve your storytelling to derive key business insight, as well as make you well-versed with advanced functionalities of Tableau in the business intelligence domain.

BookAug 2023684 pages

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages5

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages2

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

BookOct 2023636 pages5

Modern Data Architecture on AWS

Every organization wants an agile, performant, and cost-effective data platform that meets all their current and future business needs. Purpose-built AWS analytics services and their features play a big part in building such a modern data platform. This book brings to you all the design and architectural patterns that’ll help you achieve this goal.

BookAug 2023420 pages5

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

BookDec 2023240 pages

TinyML Cookbook

With over 70 project-based recipes, the TinyML Cookbook is a practical guide that will help you to get the most out of your microcontrollers. It provides a comprehensive understanding of the theoretical foundations while giving you hands-on experience training ML models for deployment on Arduino Nano 33 BLE Sense, Raspberry Pi Pico, and SparkFun RedBoard Artemis Nano microcontrollers.

BookNov 2023664 pages