Reader small image

You're reading from  Python 3 Text Processing with NLTK 3 Cookbook

Product typeBook
Published inAug 2014
Reading LevelBeginner
Publisher
ISBN-139781782167853
Edition1st Edition
Languages
Tools
Right arrow
Author (1)
Jacob Perkins
Jacob Perkins
author image
Jacob Perkins

Jacob Perkins is the cofounder and CTO of Weotta, a local search company. Weotta uses NLP and machine learning to create powerful and easy-to-use natural language search for what to do and where to go. He is the author of Python Text Processing with NLTK 2.0 Cookbook, Packt Publishing, and has contributed a chapter to the Bad Data Handbook, O'Reilly Media. He writes about NLTK, Python, and other technology topics at http://streamhacker.com. To demonstrate the capabilities of NLTK and natural language processing, he developed http://text-processing.com, which provides simple demos and NLP APIs for commercial use. He has contributed to various open source projects, including NLTK, and created NLTK-Trainer to simplify the process of training NLTK models. For more information, visit https://github.com/japerk/nltk-trainer.
Read more about Jacob Perkins

Right arrow

Chapter 9. Parsing Specific Data Types

In this chapter, we will cover the following recipes:

  • Parsing dates and times with dateutil

  • Timezone lookup and conversion

  • Extracting URLs from HTML with lxml

  • Cleaning and stripping HTML

  • Converting HTML entities with BeautifulSoup

  • Detecting and converting character encodings

Introduction


This chapter covers parsing specific kinds of data, focusing primarily on dates, times, and HTML. Luckily, there are a number of useful libraries to accomplish this, so we don't have to delve into tricky and overly complicated regular expressions. These libraries can be great complements to NLTK:

  • dateutil provides datetime parsing and timezone conversion

  • lxml and BeautifulSoup can parse, clean, and convert HTML

  • charade and UnicodeDammit can detect and convert text character encoding

These libraries can be useful for preprocessing text before passing it to an NLTK object, or postprocessing text that has been processed and extracted using NLTK. Coming up is an example that ties many of these tools together.

Let's say you need to parse a blog article about a restaurant. You can use lxml or BeautifulSoup to extract the article text, outbound links, and the date and time when the article was written. The date and time can then be parsed to a Python datetime object with dateutil. Once...

Parsing dates and times with dateutil


If you need to parse dates and times in Python, there is no better library than dateutil. The parser module can parse datetime strings in many more formats than can be shown here, while the tz module provides everything you need for looking up timezones. When combined, these modules make it quite easy to parse strings into timezone-aware datetime objects.

Getting ready

You can install dateutil using pip or easy_install, that is, sudo pip install dateutil==2.0 or sudo easy_install dateutil==2.0. You need the 2.0 version for Python 3 compatibility. The complete documentation can be found at http://labix.org/python-dateutil.

How to do it...

Let's dive into a few parsing examples:

>>> from dateutil import parser
>>> parser.parse('Thu Sep 25 10:36:28 2010')
datetime.datetime(2010, 9, 25, 10, 36, 28)
>>> parser.parse('Thursday, 25. September 2010 10:36AM')
datetime.datetime(2010, 9, 25, 10, 36)
>>> parser.parse('9/25/2010 10...

Timezone lookup and conversion


Most datetime objects returned from the dateutil parser are naïve, meaning they don't have an explicit tzinfo, which specifies the timezone and UTC offset. In the previous recipe, only one of the examples had a tzinfo, and that's because it's in the standard ISO format for UTC datetime strings. UTC is the coordinated universal time, and is basically the same as GMT. ISO is the International Standards Organization, which among other things, specifies standard datetime formatting.

Python datetime objects can either be naïve or aware. If a datetime object has a tzinfo, then it is aware. Otherwise, the datetime is naïve. To make a naïve datetime object timezone aware, you must give it an explicit tzinfo. However, the Python datetime library only defines an abstract baseclass for tzinfo, and leaves it up to others to actually implement tzinfo creation. This is where the tz module of dateutil comes in—it provides everything you need to look up timezones from your...

Extracting URLs from HTML with lxml


A common task when parsing HTML is extracting links. This is one of the core functions of every general web crawler. There are a number of Python libraries for parsing HTML, and lxml is one of the best. As you'll see, it comes with some great helper functions geared specifically towards link extraction.

Getting ready

lxml is a Python binding for the C libraries libxml2 and libxslt. This makes it a very fast XML and HTML parsing library, while still being Pythonic. But that also means you need to install the C libraries for it to work. Installation instructions are available at http://lxml.de/installation.html. But if you're running Ubuntu Linux, installation is as easy as sudo apt-get install python-lxml. You can also try doing pip install lxml. The latest version as of this writing is 3.3.5.

How to do it...

lxml comes with an html module designed specifically for parsing HTML. Using the fromstring() function, we can parse an HTML string and get a list of...

Cleaning and stripping HTML


Cleaning up text is one of the unfortunate but entirely necessary aspects of text processing. When it comes to parsing HTML, you probably don't want to deal with any embedded JavaScript or CSS, and are only interested in the tags and text.

Getting ready

You'll need to install lxml. See the previous recipe or http://lxml.de/installation.html for installation instructions.

How to do it...

We can use the clean_html() function in the lxml.html.clean module to remove unnecessary HTML tags and embedded JavaScript from an HTML string:

>>> import lxml.html.clean
>>> lxml.html.clean.clean_html('<html><head></head><body onload=loadfunc()>my text</body></html>')
'<div><body>my text</body></div>'

The result is much cleaner and easier to deal with.

How it works...

The lxml.html.clean_html() function parses the HTML string into a tree and then iterates over and removes all nodes that should be removed. It...

Converting HTML entities with BeautifulSoup


HTML entities are strings such as "&amp;" or "&lt;". These are encodings of normal ASCII characters that have special uses in HTML. For example, "&lt;" is the entity for "<", but you can't just have "<" within HTML tags because it is the beginning character for an HTML tag, hence the need to escape it and define the "&lt;" entity. "&amp;" is the entity code for "&", which as we've just seen is the beginning character for an entity code. If you need to process the text within an HTML document, then you'll want to convert these entities back to their normal characters so you can recognize them and handle them appropriately.

Getting ready

You'll need to install BeautifulSoup, which you should be able to do with sudo pip install beautifulsoup4 or sudo easy_install beautifulsoup4. You can read more about BeautifulSoup at http://www.crummy.com/software/BeautifulSoup/.

How to do it...

BeautifulSoup is an HTML parser library...

Detecting and converting character encodings


A common occurrence with text processing is finding text that has nonstandard character encoding. Ideally, all text would be ASCII or utf-8, but that's just not the reality. In cases when you have non-ASCII or non-utf-8 text and you don't know what the character encoding is, you'll need to detect it and convert the text to a standard encoding before doing further processing.

Getting ready

You'll need to install the charade module using sudo pip install charade or sudo easy_install charade. You can learn more about charade at https://pypi.python.org/pypi/charade.

How to do it...

Encoding detection and conversion functions are provided in encoding.py. These are simple wrapper functions around the charade module. To detect the encoding of a string, call encoding.detect(string). You'll get back a dict containing two attributes: confidence and encoding. The confidence attribute is a probability of how confident charade is that the value for encoding is...

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Python 3 Text Processing with NLTK 3 Cookbook
Published in: Aug 2014Publisher: ISBN-13: 9781782167853
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Jacob Perkins

Jacob Perkins is the cofounder and CTO of Weotta, a local search company. Weotta uses NLP and machine learning to create powerful and easy-to-use natural language search for what to do and where to go. He is the author of Python Text Processing with NLTK 2.0 Cookbook, Packt Publishing, and has contributed a chapter to the Bad Data Handbook, O'Reilly Media. He writes about NLTK, Python, and other technology topics at http://streamhacker.com. To demonstrate the capabilities of NLTK and natural language processing, he developed http://text-processing.com, which provides simple demos and NLP APIs for commercial use. He has contributed to various open source projects, including NLTK, and created NLTK-Trainer to simplify the process of training NLTK models. For more information, visit https://github.com/japerk/nltk-trainer.
Read more about Jacob Perkins