Parsing Specific Data in Python Text Processing

Jacob Perkins

November 2010

Python Text Processing with NLTK 2.0 Cookbook

Python Text Processing with NLTK 2.0 Cookbook

Use Python's NLTK suite of libraries to maximize your Natural Language Processing capabilities.

  • Quickly get to grips with Natural Language Processing – with Text Analysis, Text Mining, and beyond
  • Learn how machines and crawlers interpret and process natural languages
  • Easily work with huge amounts of data and learn how to handle distributed processing
  • Part of Packt's Cookbook series: Each recipe is a carefully organized sequence of instructions to complete the task as efficiently as possible
        Read more about this book      

(For more resources on Python, see here.)


This article covers parsing specific kinds of data, focusing primarily on dates, times, and HTML. Luckily, there are a number of useful libraries for accomplishing this, so we don't have to delve into tricky and overly complicated regular expressions. These libraries can be great complements to the NLTK:

  • dateutil: Provides date/time parsing and time zone conversion
  • timex: Can identify time words in text
  • lxml and BeautifulSoup: Can parse, clean, and convert HTML
  • chardet: Detects the character encoding of text

The libraries can be useful for pre-processing text before passing it to an NLTK object, or post-processing text that has been processed and extracted using NLTK. Here's an example that ties many of these tools together.

Let's say you need to parse a blog article about a restaurant. You can use lxml or BeautifulSoup to extract the article text, outbound links, and the date and time when the article was written. The date and time can then be parsed to a Python datetime object with dateutil. Once you have the article text, you can use chardet to ensure it's UTF-8 before cleaning out the HTML and running it through NLTK-based part-of-speech tagging, chunk extraction, and/or text classification, to create additional metadata about the article. If there's an event happening at the restaurant, you may be able to discover that by looking at the time words identified by timex. The point of this example is that real-world text processing often requires more than just NLTK-based natural language processing, and the functionality covered in this article can help with those additional requirements.

Parsing dates and times with Dateutil

If you need to parse dates and times in Python, there is no better library than dateutil. The parser module can parse datetime strings in many more formats than can be shown here, while the tz module provides everything you need for looking up time zones. Combined, these modules make it quite easy to parse strings into time zone aware datetime objects.

Getting ready

You can install dateutil using pip or easy_install, that is sudo pip install dateutil or sudo easy_install dateutil. Complete documentation can be found at

How to do it...

Let's dive into a few parsing examples:

>>> from dateutil import parser
>>> parser.parse('Thu Sep 25 10:36:28 2010')
datetime.datetime(2010, 9, 25, 10, 36, 28)
>>> parser.parse('Thursday, 25. September 2010 10:36AM')
datetime.datetime(2010, 9, 25, 10, 36)
>>> parser.parse('9/25/2010 10:36:28')
datetime.datetime(2010, 9, 25, 10, 36, 28)
>>> parser.parse('9/25/2010')
datetime.datetime(2010, 9, 25, 0, 0)
>>> parser.parse('2010-09-25T10:36:28Z')
datetime.datetime(2010, 9, 25, 10, 36, 28, tzinfo=tzutc())

As you can see, all it takes is importing the parser module and calling the parse() function with a datetime string. The parser will do its best to return a sensible datetime object, but if it cannot parse the string, it will raise a ValueError.

How it works...

The parser does not use regular expressions. Instead, it looks for recognizable tokens and does its best to guess what those tokens refer to. The order of these tokens matters, for example, some cultures use a date format that looks like Month/Day/Year (the default order) while others use a Day/Month/Year format. To deal with this, the parse() function takes an optional keyword argument dayfirst, which defaults to False. If you set it to True, it can correctly parse dates in the latter format.

>>> parser.parse('25/9/2010', dayfirst=True)
datetime.datetime(2010, 9, 25, 0, 0)

Another ordering issue can occur with two-digit years. For example, '10-9-25' is ambiguous. Since dateutil defaults to the Month-Day-Year format, '10-9-25' is parsed to the year 2025. But if you pass yearfirst=True into parse(), it will be parsed to the year 2010.

>>> parser.parse('10-9-25')
datetime.datetime(2025, 10, 9, 0, 0)
>>> parser.parse('10-9-25', yearfirst=True)
datetime.datetime(2010, 9, 25, 0, 0)

There's more...

The dateutil parser can also do fuzzy parsing, which allows it to ignore extraneous characters in a datetime string. With the default value of False, parse() will raise a ValueError when it encounters unknown tokens. But if fuzzy=True, then a datetime object can usually be returned.

>>> try:
... parser.parse('9/25/2010 at about 10:36AM')
... except ValueError:
... 'cannot parse'
'cannot parse'
>>> parser.parse('9/25/2010 at about 10:36AM', fuzzy=True)
datetime.datetime(2010, 9, 25, 10, 36)

Time zone lookup and conversion

Most datetime objects returned from the dateutil parser are naive, meaning they don't have an explicit tzinfo, which specifies the time zone and UTC offset. In the previous recipe, only one of the examples had a tzinfo, and that's because it's in the standard ISO format for UTC date and time strings. UTC is the coordinated universal time, and is the same as GMT. ISO is the International Standards Organization, which among other things, specifies standard date and time formatting.

Python datetime objects can either be naive or aware. If a datetime object has a tzinfo, then it is aware. Otherwise the datetime is naive. To make a naive datetime object time one aware, you must give it an explicit tzinfo. However, the Python datetime library only defines an abstract base class for tzinfo, and leaves it up to the others to actually implement tzinfo creation. This is where the tz module of dateutil comes in—it provides everything you need to lookup time zones from your OS time zone data.

Getting ready

dateutil should be installed using pip or easy_install. You should also make sure your operating system has time zone data. On Linux, this is usually found in /usr/share/zoneinfo, and the Ubuntu package is called tzdata. If you have a number of files and directories in /usr/share/zoneinfo, such as America/, Europe/, and so on, then you should be ready to proceed. The following examples show directory paths for Ubuntu Linux.

How to do it...

Let's start by getting a UTC tzinfo object. This can be done by calling tz.tzutc(), and you can check that the offset is 0 by calling the utcoffset() method with a UTC datetime object.

>>> from dateutil import tz
>>> tz.tzutc()
>>> import datetime
>>> tz.tzutc().utcoffset(datetime.datetime.utcnow())

To get tzinfo objects for other time zones, you can pass in a time zone file path to the gettz() function.

>>> tz.gettz('US/Pacific')
>>> tz.gettz('US/Pacific').utcoffset(datetime.datetime.utcnow())
datetime.timedelta(-1, 61200)
>>> tz.gettz('Europe/Paris')
>>> tz.gettz('Europe/Paris').utcoffset(datetime.datetime.utcnow())
datetime.timedelta(0, 7200)

You can see the UTC offsets are timedelta objects, where the first number is days, and the second number is seconds.

If you're storing datetimes in a database, it's a good idea to store them all in UTC to eliminate any time zone ambiguity. Even if the database can recognize time zones, it's still a good practice.

To convert a non-UTC datetime object to UTC, it must be made time zone aware. If you try to convert a naive datetime to UTC, you'll get a ValueError exception. To make a naive datetime time zone aware, you simply call the replace() method with the correct tzinfo. Once a datetime object has a tzinfo, then UTC conversion can be performed by calling the astimezone() method with tz.tzutc().

>>> pst = tz.gettz('US/Pacific')
>>> dt = datetime.datetime(2010, 9, 25, 10, 36)
>>> dt.tzinfo
>>> dt.astimezone(tz.tzutc())
Traceback (most recent call last):
File "/usr/lib/python2.6/", line 1248, in __run
compileflags, 1) in test.globs
File "<doctest __main__[22]>", line 1, in <module>
ValueError: astimezone() cannot be applied to a naive datetime
>>> dt.replace(tzinfo=pst)
datetime.datetime(2010, 9, 25, 10, 36, tzinfo=tzfile('/usr/share/
>>> dt.replace(tzinfo=pst).astimezone(tz.tzutc())
datetime.datetime(2010, 9, 25, 17, 36, tzinfo=tzutc())

How it works...

The tzutc and tzfile objects are both subclasses of tzinfo. As such, they know the correct UTC offset for time zone conversion (which is 0 for tzutc). A tzfile object knows how to read your operating system's zoneinfo files to get the necessary offset data. The replace() method of a datetime object does what its name implies—it replaces attributes. Once a datetime has a tzinfo, the astimezone() method will be able to convert the time using the UTC offsets, and then replace the current tzinfo with the new tzinfo.

Note that both replace() and astimezone() return new datetime objects. They do not modify the current object.

There's more...

You can pass a tzinfos keyword argument into the dateutil parser to detect otherwise unrecognized time zones.

>>> parser.parse('Wednesday, Aug 4, 2010 at 6:30 p.m. (CDT)',
datetime.datetime(2010, 8, 4, 18, 30)
>>> tzinfos = {'CDT': tz.gettz('US/Central')}
>>> parser.parse('Wednesday, Aug 4, 2010 at 6:30 p.m. (CDT)',
fuzzy=True, tzinfos=tzinfos)
datetime.datetime(2010, 8, 4, 18, 30, tzinfo=tzfile('/usr/share/

In the first instance, we get a naive datetime since the time zone is not recognized. However, when we pass in the tzinfos mapping, we get a time zone aware datetime.

Local time zone

If you want to lookup your local time zone, you can call tz.tzlocal(), which will use whatever your operating system thinks is the local time zone. In Ubuntu Linux, this is usually specified in the /etc/timezone file.

Custom offsets

You can create your own tzinfo object with a custom UTC offset using the tzoffset object. A custom offset of one hour can be created as follows:

>>> tz.tzoffset('custom', 3600)
tzoffset('custom', 3600)

You must provide a name as the first argument, and the offset time in seconds as the second argument.

Tagging temporal expressions with Timex

The NLTK project has a little known contrib repository that contains, among other things, a module called that can tag temporal expressions. A temporal expression is just one or more time words, such as "this week", or "next month". These are ambiguous expressions that are relative to some other point in time, like when the text was written. The timex module provides a way to annotate text so these expressions can be extracted for further analysis. More on TIMEX can be found at

Getting ready

The module is part of the nltk_contrib package, which is separate from the current version of NLTK. This means you need to install it yourself, or use the module. You can also download directly from

If you want to install the entire nltk_contrib package, you can check out the source at and do sudo python install from within the nltk_contrib folder. If you do this, you'll need to do from nltk_contrib import timex instead of just import timex as done in the following How to do it... section.

For this recipe, you have to download the module into the same folder as the rest of the code, so that import timex does not cause an ImportError.

You'll also need to get the egenix-mx-base package installed. This is a C extension library for Python, so if you have all the correct Python development headers installed, you should be able to do sudo pip install egenix-mx-base or sudo easy_install egenix-mxbase. If you're running Ubuntu Linux, you can instead do sudo apt-get install pythonegenix-mxdatetime. If none of those work, you can go to to download the package and find installation instructions.

How to do it...

Using timex is very simple: pass a string into the timex.tag() function and get back an annotated string. The annotations will be XML TIMEX tags surrounding each temporal expression.

>>> import timex
>>> timex.tag("Let's go sometime this week")
"Let's go sometime <TIMEX2>this week</TIMEX2>"
>>> timex.tag("Tomorrow I'm going to the park.")
"<TIMEX2>Tomorrow</TIMEX2> I'm going to the park."

How it works...

The implementation of is essentially over 300 lines of conditional regular expression matches. When one of the known expressions match, it creates a RelativeDateTime object (from the mx.DateTime module). This RelativeDateTime is then converted back to a string with surrounding TIMEX tags and replaces the original matched string in the text.

There's more...

timex is smart enough not to tag expressions that have already been tagged, so it's ok to pass TIMEX tagged text into the tag() function.

>>> timex.tag("Let's go sometime <TIMEX2>this week</TIMEX2>")
"Let's go sometime <TIMEX2>this week</TIMEX2>"

        Read more about this book      

(For more resources on Python, see here.)

Extracting URLs from HTML with lxml

A common task when parsing HTML is extracting links. This is one of the core functions of every general web crawler. There are a number of Python libraries for parsing HTML, and lxml is one of the best. As you'll see, it comes with some great helper functions geared specifically towards link extraction.

Getting ready

lxml is a Python binding for the C libraries libxml2 and libxslt. This makes it a very fast XML and HTML parsing library, while still being pythonic. However, that also means you need to install the C libraries for it to work. Installation instructions are at However, if you're running Ubuntu Linux, installation is as easy as sudo apt-get install python-lxml.

How to do it...

lxml comes with an html module designed specifically for parsing HTML. Using the fromstring() function, we can parse an HTML string, then get a list of all the links. The iterlinks() method generates four-tuples of the form (element, attr, link, pos):

  • element: This is the parsed node of the anchor tag from which the link is extracted. If you're just interested in the link, you can ignore this.
  • attr: This is the attribute the link came from, which is usually href.
  • link: This is the actual URL extracted from the anchor tag.
  • pos: This is the numeric index of the anchor tag in the document. The first tag has a pos of 0, the second has a pos of 1, and so on.

Following is some code to demonstrate:

>>> from lxml import html
>>> doc = html.fromstring('Hello <a href="/world">world</a>')
>>> links = list(doc.iterlinks())
>>> len(links)
>>> (el, attr, link, pos) = links[0]
>>> attr
>>> link
>>> pos

How it works...

lxml parses the HTML into an ElementTree. This is a tree structure of parent nodes and child nodes, where each node represents an HTML tag, and contains all the corresponding attributes of that tag. Once the tree is created, it can be iterated on to find elements, such as the a or anchor tag. The core tree handling code is in the lxml.etree module, while the lxml.html module contains only HTML-specific functions for creating and iterating a tree. For complete documentation, see the lxml tutorial:

There's more...

You'll notice in the previous code that the link is relative, meaning it's not an absolute URL. We can make it absolute by calling the make_links_absolute() method with a base URL before extracting the links.

>>> doc.make_links_absolute('http://hello')
>>> abslinks = list(doc.iterlinks())
>>> (el, attr, link, pos) = abslinks[0]
>>> link

Extracting links directly

If you don't want to do anything other than extract links, you can call the iterlinks() function with an HTML string.

>>> links = list(html.iterlinks('Hello <a href="/world">world</a>'))
>>> links[0][2]

Parsing HTML from URLs or files

Instead of parsing an HTML string using the fromstring() function, you can call the parse() function with a URL or file name. For example, html.parse("http://my/url") or html.parse("/path/to/file"). The result will be the same as if you loaded the URL or file into a string yourself, then called fromstring().

Extracting links with XPaths

Instead of using the iterlinks() method, you can also get links using the xpath() method, which is a general way to extract whatever you want from HTML or XML parse trees.

>>> doc.xpath('//a/@href')[0]

For more on XPath syntax, see

Cleaning and stripping HTML

Cleaning up text is one of the unfortunate but entirely necessary aspects of text processing. When it comes to parsing HTML, you probably don't want to deal with any embedded JavaScript or CSS, and are only interested in the tags and text. Or you may want to remove the HTML entirely, and process only the text. This recipe covers how to do both of these preprocessing actions.

Getting ready

You'll need to install lxml. See the previous recipe or for installation instructions. You'll also need NLTK installed for stripping HTML.

How to do it...

We can use the clean_html() function in the lxml.html.clean module to remove unnecessary HTML tags and embedded JavaScript from an HTML string.

>>> import lxml.html.clean
>>> lxml.html.clean.clean_html('<html><head></head><body
onload=loadfunc()>my text</body></html>')
'<div><body>my text</body></div>'

The result is much cleaner and easier to deal with. The full module path to the clean_html() function is used because there's also has a clean_html() function in the nltk.util module, but its purpose is different. The nltk.util.clean_html() function removes all HTML tags when you just want the text.

>>> import nltk.util
>>> nltk.util.clean_html('<div><body>my text</body></div>')
'my text'

How it works...

The lxml.html.clean_html() function parses the HTML string into a tree, then iterates over and removes all nodes that should be removed. It also cleans nodes of unnecessary attributes (such as embedded JavaScript) using regular expression matching and substitution.

The nltk.util.clean_html() function performs a bunch of regular expression substitutions to remove HTML tags. To be safe, it's best to strip the HTML after cleaning it to ensure the regular expressions will match.

There's more...

The lxml.html.clean module defines a default Cleaner class that's used when you call clean_html(). You can customize the behavior of this class by creating your own instance and calling its clean_html() method. For more details on this class, see

Converting HTML entities with BeautifulSoup

HTML entities are strings such as & or <. These are encodings of normal ASCII characters that have special uses in HTML. For example, < is the entity for <. You can't just have < within HTML tags because it is the beginning character for an HTML tag, hence the need to escape it and define the < entity. The entity code for & is & which, as we've just seen, is the beginning character for an entity code. If you need to process the text within an HTML document, then you'll want to convert these entities back to their normal characters so you can recognize them and handle them appropriately.

Getting ready

You'll need to install BeautifulSoup, which you should be able to do with sudo pip install BeautifulSoup or sudo easy_install BeautifulSoup. You can read more about BeautifulSoup at

How to do it...

BeautifulSoup is an HTML parser library that also contains an XML parser called BeautifulStoneSoup. This is what we can use for entity conversion. It's quite simple: create an instance of BeautifulStoneSoup given a string containing HTML entities and specify the keyword argument convertEntities='html'. Convert this instance to a string, and you'll get the ASCII representation of the HTML entities.

>>> from BeautifulSoup import BeautifulStoneSoup
>>> unicode(BeautifulStoneSoup('&lt;', convertEntities='html'))
>>> unicode(BeautifulStoneSoup('&amp;', convertEntities='html'))

It's ok to run the string through multiple times, as long as the ASCII characters are not by themselves. If your string is just a single ASCII character for an HTML entity, that character will be lost.

>>> unicode(BeautifulStoneSoup('<', convertEntities='html'))
>>> unicode(BeautifulStoneSoup('< ', convertEntities='html'))
u'< '

To make sure the character isn't lost, all that's required is to have another character in the string that is not part of an entity code.

How it works...

To convert the HTML entities, BeautifulStoneSoup looks for tokens that look like an entity and replaces them with their corresponding value in the htmlentitydefs.name2codepoint dictionary from the Python standard library. It can do this if the entity token is within an HTML tag, or when it's in a normal string.

There's more...

BeautifulSoup is an excellent HTML and XML parser in its own right, and can be a great alternative to lxml. It's particularly good at handling malformed HTML. You can read more about how to use it at

Extracting URLs with BeautifulSoup

Here's an example of using BeautifulSoup to extract URLs, like we did in the Extracting URLs from HTML with lxml recipe. You first create the soup with an HTML string, call the findAll() method with 'a' to get all anchor tags, and pull out the 'href' attribute to get the URLs.

>>> from BeautifulSoup import BeautifulSoup
>>> soup = BeautifulSoup('Hello <a href="/world">world</a>')
>>> [a['href'] for a in soup.findAll('a')]

Detecting and converting character encodings

A common occurrence with text processing is finding text that has a non-standard character encoding. Ideally, all text would be ASCII or UTF-8, but that's just not the reality. In cases when you have non-ASCII or non-UTF-8 text and you don't know what the character encoding is, you'll need to detect it and convert the text to a standard encoding before further processing it.

Getting ready

You'll need to install the chardet module, using sudo pip install chardet or sudo easy_install chardet. You can learn more about chardet at

How to do it...

Encoding detection and conversion functions are provided in These are simple wrapper functions around the chardet module. To detect the encoding of a string, call encoding.detect(). You'll get back a dict containing two attributes: confidence and encoding. confidence is a probability of how confident chardet is that the value for encoding is correct.

# -*- coding: utf-8 -*-
import chardet

def detect(s):
return chardet.detect(s)
except UnicodeDecodeError:
return chardet.detect(s.encode('utf-8'))

def convert(s):
encoding = detect(s)['encoding']

if encoding == 'utf-8':
return unicode(s)
return unicode(s, encoding)

Here's some example code using detect() to determine character encoding:

>>> import encoding
>>> encoding.detect('ascii')
{'confidence': 1.0, 'encoding': 'ascii'}
>>> encoding.detect(u'abcdé')
{'confidence': 0.75249999999999995, 'encoding': 'utf-8'}
>>> encoding.detect('\222\222\223\225')
{'confidence': 0.5, 'encoding': 'windows-1252'}

To convert a string to a standard unicode encoding, call encoding.convert(). This will decode the string from its original encoding, then re-encode it as UTF-8.

>>> encoding.convert('ascii')
>>> encoding.convert(u'abcdé')
>>> encoding.convert('\222\222\223\225')

How it works...

The detect() function is a wrapper around chardet.detect() which can handle UnicodeDecodeError exceptions. In these cases, the string is encoded in UTF-8 before trying to detect the encoding.

The convert() function first calls detect() to get the encoding, then returns a unicode string with the encoding as the second argument. By passing the encoding into unicode(), the string is decoded from the original encoding, allowing it to be re-encoded into a standard encoding.

There's more...

The comment at the top of the module, # -*- coding: utf-8 -*-, is a hint to the Python interpreter, telling it which encoding to use for the strings in the code. This is helpful for when you have non-ASCII strings in your source code, and is documented in detail at

Converting to ASCII

If you want pure ASCII text, with non-ASCII characters converted to ASCII equivalents, or dropped if there is no equivalent character, then you can use the unicodedata.normalize() function.

>>> import unicodedata
>>> unicodedata.normalize('NFKD', u'abcd\xe9').encode('ascii',

Specifying 'NFKD' as the first argument ensures the non-ASCII characters are replaced with their equivalent ASCII versions, and the final call to encode() with 'ignore' as the second argument will remove any extraneous unicode characters.


This article covered parsing specific kinds of data, focusing primarily on dates, times, and HTML.

Further resources on this subject:

You've been reading an excerpt of:

Python Text Processing with NLTK 2.0 Cookbook

Explore Title