Regular Expressions in Python 2.6 Text Processing

Exclusive offer: get 50% off this eBook here
Python 2.6 Text Processing: Beginners Guide

Python 2.6 Text Processing: Beginners Guide — Save 50%

The easiest way to learn how to manipulate text with Python

$26.99    $13.50
by Jeff McNeil | January 2011 | Beginner's Guides Open Source

In this article, by Jeff McNeil, author of Python 2.6 Text Processing Beginner's Guide, we'll look at the following aspects of regular expression usage.

  • Basic syntax and special characters. How do you build a regular expression and what should you expect it to match with?
  • More advanced processing. Grouping results and performing conditional matches via look-ahead and look-behind assertions. What makes an expression greedy?
  • Python's implementation. Elements such as matches versus searches, and regular expression compilation and its effect on processing.
  • What happens when we attempt to use regular expressions to process internationalized (non-ASCII) text or look at multiline data?

 

Python 2.6 Text Processing: Beginners Guide

Python 2.6 Text Processing: Beginners Guide

The easiest way to learn how to manipulate text with Python

  • The easiest way to learn text processing with Python
  • Deals with the most important textual data formats you will encounter
  • Learn to use the most popular text processing libraries available for Python
  • Packed with examples to guide you through
        Read more about this book      

(For more resources on this subject, see here.)

Simple string matching

Regular expressions are notoriously hard to read, especially if you're not familiar with the obscure syntax. For that reason, let's start simple and look at some easy regular expressions at the most basic level. Before we begin, remember that Python raw strings allow us to include backslashes without the need for additional escaping.

Whenever you define regular expressions, you should do so using the raw string syntax.

Time for action – testing an HTTP URL

In this example, we'll check values as they're entered via the command line as a means to introduce the technology. We'll dive deeper into regular expressions as we move forward. We'll be scanning URLs to ensure our end users inputted valid data.

  1. Create a new file and name it number_regex.py.
  2. Enter the following code:

    import sys
    import re

    # Make sure we have a single URL argument.
    if len(sys.argv) != 2:
    print >>sys.stderr, "URL Required"
    sys.exit(-1)
    # Easier access.
    url = sys.argv[1]

    # Ensure we were passed a somewhat valid URL.
    # This is a superficial test.
    if re.match(r'^https?:/{2}\w.+$', url):
    print "This looks valid"
    else:
    print "This looks invalid"

  3. Now, run the example script on the command line a few times, passing various different values to it on the command line.

    (text_processing)$ python url_regex.py http://www.jmcneil.net
    This looks valid
    (text_processing)$ python url_regex.py http://intranet
    This looks valid
    (text_processing)$ python url_regex.py http://www.packtpub.com
    This looks valid
    (text_processing)$ python url_regex.py https://store
    This looks valid
    (text_processing)$ python url_regex.py httpsstore
    This looks invalid
    (text_processing)$ python url_regex.py https:??store
    This looks invalid
    (text_processing)$

What just happened?

We took a look at a very simple pattern and introduced you to the plumbing needed to perform a match test. Let's walk through this little example, skipping the boilerplate code.

First of all, we imported the re module. The re module, as you probably inferred from the name, contains all of Python's regular expression support.

Any time you need to work with regular expressions, you'll need to import the re module.

Next, we read a URL from the command line and bind a temporary attribute, which makes for cleaner code. Directly below that, you should notice a line that reads re.match(r'^https?:/{2}\w.+$', url). This line checks to determine whether the string referenced by the url attribute matches the ^https?:/{2}\w.+$ pattern.

If a match is found, we'll print a success message; otherwise, the end user would receive some negative feedback indicating that the input value is incorrect.

This example leaves out a lot of details regarding HTTP URL formats. If you were performing validation on user input, one place to look would be http://formencode.org/. FormEncode is a HTML form-processing and data-validation framework written by Ian Bicking.

Understanding the match function

The most basic method of testing for a match is via the re.match function, as we did in the previous example. The match function takes a regular expression pattern and a string value. For example, consider the following snippet of code:

Python 2.6.1 (r261:67515, Feb 11 2010, 00:51:29)
[GCC 4.2.1 (Apple Inc. build 5646)] on darwin
Type "help", "copyright", "credits", or "license" for more information.
>>> import re
>>> re.match(r'pattern', 'pattern')
<_sre.SRE_Match object at 0x1004811d0>
>>>

Here, we simply passed a regular expression of "pattern" and a string literal of "pattern" to the re.match function. As they were identical, the result was a match. The returned Match object indicates the match was successful. The re.match function returns None otherwise.

>>> re.match(r'pattern', 'failure')
>>>

Learning basic syntax

A regular expression is generally a collection of literal string data and special metacharacters that represents a pattern of text. The simplest regular expression is just literal text that only matches itself.

In addition to literal text, there are a series of special characters that can be used to convey additional meaning, such as repetition, sets, wildcards, and anchors. Generally, the punctuation characters field this responsibility.

Detecting repetition

When building up expressions, it's useful to be able to match certain repeating patterns without needing to duplicate values. It's also beneficial to perform conditional matches. This lets us check for content such as "match the letter a, followed by the number one at least three times, but no more than seven times."

For example, the code below does just that:

Python 2.6.1 (r261:67515, Feb 11 2010, 00:51:29)
[GCC 4.2.1 (Apple Inc. build 5646)] on darwin
Type "help", "copyright", "credits", or "license" for more information.
>>> import re
>>> re.match(r'^a1{3,7}$', 'a1111111')
<_sre.SRE_Match object at 0x100481648>
>>> re.match(r'^a1{3,7}$', '1111111')
>>>

If the repetition operator follows a valid regular expression enclosed in parenthesis, it will perform repetition on that entire expression. For example:

>>> re.match(r'^(a1){3,7}$', 'a1a1a1')
<_sre.SRE_Match object at 0x100493918>
>>> re.match(r'^(a1){3,7}$', 'a11111')
>>>

The following table details all of the special characters that can be used for marking repeating values within a regular expression.

Specifying character sets and classes

In some circumstances, it's useful to collect groups of characters into a set such that any of the values in the set will trigger a match. It's also useful to match any character at all. The dot operator does just that.

A character set is enclosed within standard square brackets. A set defines a series of alternating (or) entities that will match a given text value. If the first character within a set is a caret (^) then a negation is performed. All characters not defined by that set would then match.

There are a couple of additional interesting set properties.

  1. For ranged values, it's possible to specify an entire selection using a hyphen. For example, '[0-6a-d]' would match all values between 0 and 6, and a and d.
  2. Special characters listed within brackets lose their special meaning. The exceptions to this rule are the hyphen and the closing bracket.

If you need to include a closing bracket or a hyphen within a regular expression, you can either place them as the first elements in the set or escape them by preceding them with a backslash.

As an example, consider the following snippet, which matches a string containing a hexadecimal number.

Python 2.6.1 (r261:67515, Feb 11 2010, 00:51:29)
[GCC 4.2.1 (Apple Inc. build 5646)] on darwin
Type "help", "copyright", "credits", or "license" for more information.
>>> import re
>>> re.match(r'^0x[a-f0-9]+$', '0xff')
<_sre.SRE_Match object at 0x100481648>
>>> re.match(r'^0x[a-f0-9]+$', '0x01')
<_sre.SRE_Match object at 0x1004816b0>
>>> re.match(r'^0x[a-f0-9]+$', '0xz')
>>>

In addition to the bracket notation, Python ships with some predefined classes. Generally, these are letter values prefixed with a backslash escape. When they appear within a set, the set includes all values for which they'll match. The \d escape matches all digit values. It would have been possible to write the above example in a slightly more compact manner.

>>> re.match(r'^0x[a-f\d]+$', '0x33')
<_sre.SRE_Match object at 0x100481648>
>>> re.match(r'^0x[a-f\d]+$', '0x3f')
<_sre.SRE_Match object at 0x1004816b0>
>>>

The following table outlines all of the character sets and classes available:

One thing that should become apparent is that lowercase classes are matches whereas their uppercase counterparts are the inverse.

Applying anchors to restrict matches

There are times where it's important that patterns match at a certain position within a string of text. Why is this important? Consider a simple number validation test. If a user enters a digit, but mistakenly includes a trailing letter, an expression checking for the existence of a digit alone will pass.

Python 2.6.1 (r261:67515, Feb 11 2010, 00:51:29)
[GCC 4.2.1 (Apple Inc. build 5646)] on darwin
Type "help", "copyright", "credits", or "license" for more information.
>>> import re
>>> re.match(r'\d', '1f')
<_sre.SRE_Match object at 0x1004811d0>
>>>

Well, that's unexpected. The regular expression engine sees the leading '1' and considers it a match. It disregards the rest of the string as we've not instructed it to do anything else with it. To fix the problem that we have just seen, we need to apply anchors.

>>> re.match(r'^\d$', '6')
<_sre.SRE_Match object at 0x100481648>
>>> re.match(r'^\d$', '6f')
>>>

Now, attempting to sneak in a non-digit character results in no match. By preceding our expression with a caret (^) and terminating it with a dollar sign ($), we effectively said "between the start and the end of this string, there can only be one digit."

Anchors, among various other metacharacters, are considered zero-width matches. Basically, this means that a match doesn't advance the regular expression engine within the test string.

We're not limited to the either end of a string, either. Here's a collection of all of the available anchors provided by Python.

Wrapping it up

Now that we've covered the basics of regular expression syntax, let's double back and take a look at the expression we used in our first example. It might be a bit easier if we break it down a bit more with a diagram.

Now that we've provided a bit of background, this pattern should make sense. We begin the regular expression with a caret, which matches the beginning of the string. The very next element is the literal http. As our caret matches the start of a string and must be immediately followed by http, this is equivalent to saying that our string must start with http.

Next, we include a question mark after the s in https. The question mark states that the previous entity should be matched either zero, or one time. By default, the evaluation engine is looking character-by-character, so the previous entity in this case is simply "s." We do this so our test passes for both secure and non-secure addresses.

As we advanced forward in our string, the next special term we run into is {2}, and it follows a simple forward slash. This says that the forward slash should appear exactly two times. Now, in the real world, it would probably make more sense to simply type the second slash. Using the repetition check like this not only requires more typing, but it also causes the regular expression engine to work harder.

Immediately after the repetition match, we include a \w. The \w, if you'll remember from the previous tables, expands to [0-9a-zA-Z_], or any word character. This is to ensure that our URL doesn't begin with a special character.

The dot character after the \w matches anything, except a new line. Essentially, we're saying "match anything else, we don't so much care." The plus sign states that the preceding wild card should match at least once.

Finally, we're anchoring the end of the string. However, in this example, this isn't really necessary.

Have a go hero – tidying up our URL test

There are a few intentional inconsistencies and problems with this regular expression as designed. To name a few:

  1. Properly formatted URLs should only contain a few special characters. Other values should be URL-encoded using percent escapes. This regular expression doesn't check for that.
  2. It's possible to include newline characters towards the end of the URL, which is clearly not supported by any browsers!
  3. The \w followed by the. + implicitly set a minimum limit of two characters after the protocol specification. A single letter is perfectly valid.

You guessed it. Using what we've covered thus far, it should be possible for you to backtrack and update our regular expression in order to fix these flaws. For more information on what characters are allowed, have a look at http://www.w3schools.com/tags/ref_urlencode.asp.

Advanced pattern matching

In addition to basic pattern matching, regular expressions let us handle some more advanced situations as well. It's possible to group characters for purposes of precedence and reference, perform conditional checks based on what exists later, or previously, in a string, and limit exactly how much of a match actually constitutes a match. Don't worry; we'll clarify that last phrase as we move on. Let's go!

Grouping

When crafting a regular expression string, there are generally two reasons you would wish to group expression components together: entity precedence or to enable access to matched parts later in your application.

Time for action – regular expression grouping

In this example, we'll return to our LogProcessing application. Here, we'll update our log split routines to divide lines up via a regular expression as opposed to simple string manipulation.

  1. In core.py, add an import re statement to the top of the file. This makes the regular expression engine available to us.
  2. Directly above the __init__ method definition for LogProcessor, add the following lines of code. These have been split to avoid wrapping.

    _re = re.compile(
    r'^([\d.]+) (\S+) (\S+) \[([\w/:+ ]+)] "(.+?)" ' \
    r'(?P<rcode>\d{3}) (\S+) "(\S+)" "(.+)"')

  3. Now, we're going to replace the split method with one that takes advantage of the new regular expression:

    def split(self, line):
    """
    Split a logfile.
    Uses a simple regular expression to parse out the Apache
    logfile
    entries.
    """
    line = line.strip()
    match = re.match(self._re, line)
    if not match:
    raise ParsingError("Malformed line: " + line)
    return {
    'size': 0 if match.group(6) == '-'
    else int(match.group(6)),
    'status': match.group('rcode'),
    'file_requested': match.group(5).split()[1]
    }

  4. Running the logscan application should now produce the same output as it did when we were using a more basic, split-based approach.

    (text_processing)$ cat example3.log | logscan -c logscan.cfg

What just happened?

First of all, we imported the re module so that we have access to Python's regular expression services.

Next, at the LogProcessor class level, we defined a regular expression. Though, this time we did so via re.compile rather than a simple string. Regular expressions that are used more than a handful of times should be "prepared" by running them through re.compile first. This eases the load placed on the system by frequently used patterns. The re.compile function returns a SRE_Pattern object that can be passed in just about anywhere you can pass in a regular expression.

We then replace our split method to take advantage of regular expressions. As you can see, we simply pass self._re in as opposed to a string-based regular expression. If we don't have a match, we raise a ParsingError, which bubbles up and generates an appropriate error message, much like we would see on an invalid split case.

Now, the end of the split method probably looks somewhat peculiar to you. Here, we've referenced our matched values via group identification mechanisms rather than by their list index into the split results. Regular expression components surrounded by parenthesis create a group, which can be accessed via the group method on the Match object later down the road. It's also possible to access a previously matched group from within the same regular expression. Let's look at a somewhat smaller example.

>>> match = re.match(r'(0x[0-9a-f]+) (?P<two>\1)', '0xff 0xff')
>>> match.group(1)
'0xff'
>>> match.group(2)
'0xff'
>>> match.group('two')
'0xff'
>>> match.group('failure')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
IndexError: no such group
>>>

Here, we surround two distinct regular expressions components with parenthesis, (0x[0-9a-f]+), and (?P&lttwo>\1). The first regular expression matches a hexadecimal number. This becomes group ID 1. The second expression matches whatever was found by the first, via the use of the \1. The "backslash-one" syntax references the first match. So, this entire regular expression only matches when we repeat the same hexadecimal number twice, separated with a space. The ?P&lttwo> syntax is detailed below.

As you can see, the match is referenced after-the-fact using the match.group method, which takes a numeric index as its argument. Using standard regular expressions, you'll need to refer to a matched group using its index number. However, if you'll look at the second group, we added a (?P&ltname>) construct. This is a Python extension that lets us refer to groupings by name, rather than by numeric group ID. The result is that we can reference groups of this type by name as opposed to simple numbers.

Finally, if an invalid group ID is passed in, an IndexError exception is thrown.

The following table outlines the characters used for building groups within a Python regular expression:

Finally, it's worth pointing out that parenthesis can also be used to alter priority as well. For example, consider this code.

>>> re.match(r'abc{2}', 'abcc')
<_sre.SRE_Match object at 0x1004818b8>
>>> re.match(r'a(bc){2}', 'abcc')
>>> re.match(r'a(bc){2}', 'abcbc')
<_sre.SRE_Match object at 0x1004937b0>
>>>

Whereas the first example matches c exactly two times, the second and third line require us to repeat bc twice. This changes the meaning of the regular expression from "repeat the previous character twice" to "repeat the previous match within parenthesis twice." The value within the group could have been its own complex regular expression, such as a([b-c]) {2}.

Have a go hero – updating our stats processor to use named groups

Spend a couple of minutes and update our statistics processor to use named groups rather than integer-based references. This makes it slightly easier to read the assignment code in the split method. You do not need to create names for all of the groups, simply the ones we're actually using will do.

Using greedy versus non-greedy operators

Regular expressions generally like to match as much text as possible before giving up or yielding to the next token in a pattern string. If that behavior is unexpected and not fully understood, it can be difficult to get your regular expression correct. Let's take a look at a small code sample to illustrate the point.

Suppose that with your newfound knowledge of regular expressions, you decided to write a small script to remove the angled brackets surrounding HTML tags. You might be tempted to do it like this:

>>> match = re.match(r'(?P<tag><.+>)', '<title>Web Page</title>')
>>> match.group('tag')
'<title>Web Page</title>'
>>>

The result is probably not what you expected. The reason we got this result was due to the fact that regular expressions are greedy by nature. That is, they'll attempt to match as much as possible. If you look closely, &lttitle> is a match for the supplied regular expression, as is the entire &lttitle&gtWeb Page</title> string. Both start with an angled-bracket, contain at least one character, and both end with an angled bracket.

The fix is to insert the question mark character, or the non-greedy operator, directly after the repetition specification. So, the following code snippet fixes the problem.

>>> match = re.match(r'(?P<tag><.+?>)', '<title>Web Page</title>')
>>> match.group('tag')
'<title>'
>>>

The question mark changes our meaning from "match as much as you possibly can" to "match only the minimum required to actually match."

Python 2.6 Text Processing: Beginners Guide The easiest way to learn how to manipulate text with Python
Published: December 2010
eBook Price: $26.99
Book Price: $44.99
See more
Select your format and quantity:
        Read more about this book      

(For more resources on this subject, see here.)

Assertions

In a lot of cases, it's beneficial to say, "match this only if this next thing matches." In essence, to perform a conditional match based on what might or might not appear later in a string of text.

This is possible via look-ahead and look-behind assertions. Like anchors, these elements consume no characters during the match process.

The first assertion we'll look at is the positive look-ahead. The positive look-ahead will only match at the current location if followed by what's in the assertion. For example:

Python 2.6.1 (r261:67515, Feb 11 2010, 00:51:29)
[GCC 4.2.1 (Apple Inc. build 5646)] on darwin
Type "help", "copyright", "credits", or "license" for more information.
>>> import re
>>> re.match('(Python) (?=Programming)', 'Python Programming').
groups()
('Python',)
>>> re.match('(Python) (?=Programming)', 'Python Snakes')
>>>

Note how there is only one group saved in the first match. This is because the positive lookahead does not consume any characters. To look at it another way, notice how the following snippet does not match at all:

>>> re.match('^(Python) (?=Programming) Language', 'Python Programming
Language')
>>>

To make for a match, we need to still check for the "Programming" string, even though we've specified it in the look-ahead.

>>> re.match('(Python) (?=Programming)Programming Language',
... 'Python Programming Language')
<_sre.SRE_Match object at 0x1004938a0>
>>>

A negative look-ahead assertion will only match if the pattern defined in the assertion doesn't match. Assuming we actually didn't want the programming language, we could alter our expression as follows:

>>> re.match('(Python) (?!Programming)', 'Python Snake')
>>>

Each look-ahead has a corresponding look-behind. That is, it's also possible to check the value of an input string immediately leading up to the match in question. Though unlike lookahead assertions, these look-behind checks must be of a fixed width. This means that while we can check for abcd, we could not check for \w{0,4}. Here's a quick example of lookbehinds at work:

<_sre.SRE_Match object at 0x100481648>
>>> re.match('123(?<!abc)456', '123456')
<_sre.SRE_Match object at 0x1004816b0>
>>>

The final type of assertion we'll look at is conditional based on whether a group already exists or not. This is a rather powerful construct as it's possible to build a somewhat complex logic directly into a regular expression. Note that doing so, however, is often done at the detriment of readability to other programmers. This functionality is new as of Python 2.4.

>>> re.match('^(?P<bracket><)?\w+@\w+\.\w+(?(bracket)>)$',
'<jeff@jmcneil.net')
>>> re.match('^(?P<bracket><)?\w+@\w+\.\w+(?(bracket)>)$',
'<jeff@jmcneil.net>')
<_sre.SRE_Match object at 0x100493918>
>>> re.match('^(?P<bracket><)?\w+@\w+\.\w+(?(bracket)>)$',
'jeff@jmcneil.net')
<_sre.SRE_Match object at 0x1004938a0>
>>>

This example shows general usage. Here, if an e-mail address begins with a bracket then it must also end with a bracket.

Here is a summary table of the assertion mechanisms and a description of each:

Performing an 'or' operation

In some cases, you may run into a situation where a position in your input text may hold more than one possible value. To test for situations like this, you can chain regular expressions together via the '|' operator, which is generally equivalent to an 'or'.

>>> re.match('(abc|123|def|cow)', 'abc').groups()
('abc',)
>>> re.match('(abc|123|def|cow)', '123').groups()
('123',)
>>> re.match('(abc|123|def|cow)', '123cow').groups()
('123',)
>>>

Here, you'll see that we match the first possible value as evaluated from left to right. We've also included our alternation within a group. The regular expressions may be arbitrarily complex.

Pop Quiz – regular expressions

  1. In the HTTP LogProcessing regular expression, we used a \S instead of a \d for a few numeric fields. Why is that the case? Is there another approach? Hint: a value that is not present is indicated by a single dash (-).
  2. Can you think of a use for the (?:…) syntax?
  3. Why would you compile a regular expression versus using a string representation?

Implementing Python-specific elements

Up until now, most of the regular expression information we've covered has been Pythonagnostic (with the exception of the (?P...) patterns). Now, let's take a look at some of the more Python-specific elements.

Other search functions

In addition to the re.match function we've been using, Python also makes a few other methods available to us. The big limitation on the match function is that it will only match at the beginning of a string. Here's a quick survey of the other available methods. We'll outline the following methods:

  • search
  • findall
  • finditer
  • split
  • sub

search

The search function will match anywhere within a string and is not limited to the beginning. While it is possible to construct re.match regular expressions that are equivalent to re.search in most cases, it's not always entirely practical.

>>> re.match('[0-9]{4}', atl-linux-8423')
>>> re.search('[0-9]{4}', 'atl-linux-8423')
<_sre.SRE_Match object at 0x1005aa988>
>>>

This example illustrates the difference given between two machine names. The match function does not begin with a matching pattern (and the expression doesn't allow for noninteger buffering), so there is no match. A search, on the other hand, scans the entire string for a match, regardless of starting point.

findall and finditer

These are two very useful and very closely related functions. The findall function will iterate through a given text buffer and return all non-overlapping matches in a list. The finditer method performs the same scan, but returns an iterator. The net result is that finditer is more memory efficient.

As a general rule, finditer is more efficient than findall as it doesn't require the construction of a new Python list object.

The following snippet of code extracts hash tags from a string and displays their offsets:

>>> for i in re.finditer(r'#\w+', 'This post is about #eggs, #ham,
water #buffalo, and #newts'):
... print '%02d-%02d: %s' % (i.start(), i.end(), i.group(0))
...
19-24: #eggs
26-30: #ham
38-46: #buffalo
52-58: #newts
>>>

Also, notice how we've used i.group(0) here. Group zero is another way of referring to the entire match.

split

The split function separates the given text at each match in a regular expression.

sub

The re.sub function is rather powerful. Given a pattern, it will search a string and replace instances that match the pattern with a replacement value. The replacement value can either be a plain string, or a callable (function). If a function is used, that function is in-turn called with the match object from the corresponding regular expression match. The text that is found is replaced with the return value of the function. The subfunction works as follows.

>>> domains = {'oldsite.com': 'newsite.com'}
>>> def repl(m):
... return domains.get(m.group(0), m.group(0))
...
>>> re.sub(r'(\w+\.?){2,}', repl, 'newsite.com oldsite.com yoursite.
com')
'newsite.com newsite.com yoursite.com'
>>>

When the given pattern matches a domain name, it calls repl. The repl function returns the corresponding value, if one is found in the dictionary. If one isn't found, we simply return what we were passed in.

This isn't an exhaustive list of all of the methods and attributes on the re module. It would be a good idea for you to read up on all of the details at http://docs.python.org/library/re.html.

Compiled expression objects

We've simply been using the re.match module-level function in most situations as it is a quick way to execute our test expressions. This works great for small or infrequently matched patterns. However, compilation provides for a richer feature set and an inherent performance boost.

A regular compiled expression object supports all of the same functionality as the flat module-level functions within the re module. The calling convention differs slightly, though, as re.match(pattern, string) becomes regex.match(string). You should also be aware of the fact that it's possible to pass compiled objects into all of the re module functions.

In addition, these objects support a few additional methods as they contain state not available using module-level calls.

The match, search, finditer, and findall methods also accept a start position and an end position so that the range of characters they'll attempt to match can be limited. For example, consider the following snippet of code:

>>> import re
>>> re_obj = re.compile(r'[0-9]+')
>>> address = 'Atlanta, GA 30303'
>>> re_obj.search(address)
<_sre.SRE_Match object at 0x100481648>
>>> re_obj.search(address, 0, 10)
>>>

The second attempt to match fails because we limit the search to the substring between positions 0 and 10. In this case, Atlanta, G is searched.

Dealing with performance issues

Using Python's timeit module, we can run a quick performance benchmark for both a compiled and a standard textual regular expression.

(text_processing)$ python -m timeit -s 'import re; m =
re.compile("^[0-9]{2}-[abcd]{3}")' 'm.match("05-abc")'
1000000 loops, best of 3: 0.491 usec per loop
(text_processing)$ python -m timeit -s 'import re' 're.match("^[0-9]
{2}-[abcd]{3}", "05-abc")'
1000000 loops, best of 3: 1.76 usec per loop
(text_processing)$

In this simple example, we matched two numbers, followed by a dash, and a series of three letters in a set. As is evident by the preceding output, compilation reduces the amount of time required to process this match by more than a factor of three.

You should familiarize yourself with Python's timeit module as you work with the language. It provides a very simple method to test and evaluate segments of code for performance comparison, just as we did above. For more information, see http://docs.python.org/library/timeit.html.

Parser flags

The re module exports a number of flags that alter the way the engine processes text. It is possible to pass a series of flags into a module-level function, or as part of the call to re.compile. Multiple flags should be strung together using the bitwise-or operator (|). Of course, flags passed in during a compilation are retained across matches.

Unicode regular expressions

If you find yourself writing applications for systems that have to work outside of the standard ASCII character set, there are certain things you should pay attention to while crafting regular expression patterns.

First and foremost, Unicode regular expressions should always be flagged as Unicode. This means that (in versions of Python prior to 3.0), they should begin with a u character. Unicode literals should then match as standard ASCII strings do. It is also possible to use a Unicode escape rather than a symbol. For example:

Our example string matches perfectly when the expression text is a Unicode object. However, as expected, it fails when we attempt to pass an ASCII string pattern.

Character sets work in a similar fashion:

Matching words (\w) is slightly more complicated. Remember, by default, the \w class matches [0-9a-zA-Z_]. If we try to apply it to characters that do not fit that range, we won't match. The trick is to include the re.UNICODE flag as part of our match function. This ensures that Python honors the Unicode database.

The most important thing to remember if you're testing or searching non-ASCII data is that common tests such as [a-zA-Z] for data elements such as a person's name are not necessarily valid. A good thumb-rule is to stick to the character class escapes (\w, \s) while including the re.UNICODE flag. This ensures that you'll match where you intend to.

When working through regular expressions that support non-ASCII letters, it's a good idea to test them often. A good resource for wide characters is http://www.translit.ru. You can generate UTF-8 Cyrillic data of any length or format required. You can also find complete Unicode escape charts at http://unicode.org/charts/.

Python 2.6 Text Processing: Beginners Guide The easiest way to learn how to manipulate text with Python
Published: December 2010
eBook Price: $26.99
Book Price: $44.99
See more
Select your format and quantity:
        Read more about this book      

(For more resources on this subject, see here.)

The match object

Till now, we've skimmed over a very important part of Python regular expressions - the Match object. A Match object is returned each time a match is found in a string that we've searched. You've seen this in previous examples in lines such as &lt_sre.SRE_Matchobject at 0x100492be8>.

Truthfully, much of the match object has already been covered. For example, we've seen the group and the groups functions, which retrieve a matched group or a tuple of all matched groups as a result of a match operation. We've also seen usage of the start and end methods, which return offsets into a string corresponding to where a match begins and where a match ends.

Let's take a look at one more example, though, to solidify the concepts we've touched on thus far.

Processing bind zone files

One of the most common server packages available on the Internet is BIND. Bind relies on a series of DNS zone files, which contain query-to-response mappings. Most commonly, hostname to IP matches.

These zone files are simply flat text files saved in a directory. On most UNIX distributions, they're located under /var/named. However, Ubuntu in particular places them under /etc/.

In this example, we'll write a script to extract the MX (Mail Exchanger) records from a DNS zone configuration file and display them. MX records are composed of a few fields. Here's a complete example:

domain.com. 900 IN MX 5 mx1.domain.com.
domain.com. 900 IN MX 10 mx1.domain.com.

This details two MX records for the domain.com domain, each with a time-to-live of 900. The record class is IN, for Internet, and the corresponding type is MX. The number following the record type is a weight, or a preference. MX records with a lower preference are preferred. Higher preference records are only used if the lower preference records are not accessible. Finally, a server name is specified.

This sounds straightforward until we throw in a few caveats.

  • The domain may not be present. If it isn't listed, it should default to the same as the previous line.
  • The domain may be @, in which case it should default to the name of the zone. There's a bit more magic to this; more on that later.
  • The TTL may be left off. If the TTL is left off, the zone default should be used. A zone default is specified with a $TTL X line.
  • If a hostname, either the domain or the MX record value itself, doesn't end with a trailing period, we should append the name of the current zone to it.
  • The whole thing can be in uppercase, lowercase, or some random combination of the two.
  • The class may be left out, in which case it defaults to IN.

Time for action – reading DNS records

Let's implement a regular expression-based solution that addresses all of these points and displays sorted MX record values.

  1. First, let's create an example zone file. This is also available as example.zone from the Packt.com FTP site.

    $TTL 86400
    @ IN SOA ns1.linode.com. domains.siteperceptive.com. (
    2010060806
    14400
    14400
    1209600
    86400
    )
    @ NS ns1.linode.com.
    @ NS ns2.linode.com.
    @ NS ns3.linode.com.
    @ NS ns4.linode.com.
    @ NS ns5.linode.com.
    jmcneil.net. IN MX 5 alt2.aspmx.l.google.com.
    jmcneil.net. IN MX 1 aspmx.l.google.com.
    IN MX 10 aspmx2.googlemail.com.
    900 IN MX 10 aspmx3.googlemail.com.
    900 in mx 10 aspmx4.googlemail.com.
    @ 900 IN MX 10 aspmx5.googlemail.com.
    @ 900 MX 5 alt1.aspmx.l.google.com.
    @ A 127.0.0.1
    sandbox IN CNAME jmcneil.net.
    www IN CNAME jmcneil.net.
    blog IN CNAME jmcneil.net.

  2. Now, within the text_beginner package directory, create a subdirectory named dnszone and create an empty __init__.py within it.
  3. Create a file named mx_order.py in that same directory with the following contents.

    import re
    import optparse
    from collections import namedtuple

    # Two differnet lines to make for
    # easier fomatting.
    ttl_re = r'^(\$TTL\s+(?P<ttl>\d+).*)$'
    mx_re = r'^((?P<dom>@|[\w.]+))?\s+(?P<dttl>\d+)?.*MX\s+(?P<wt>\
    d+)\s+(?P<tgt>.+).*$'

    # This makes it easier to reference our values and
    # makes code more readable.
    MxRecord = namedtuple('MxRecord', 'wt, dom, dttl, tgt')

    # Compile it up. We'll accept either
    # one of the previous expressions.
    zone_re = re.compile('%s|%s' % (ttl_re, mx_re),
    re.MULTILINE | re.IGNORECASE)

    def zoneify(zone, record):
    """
    Format the record correctly.
    """
    if not record or record == '@':
    record = zone + '.'
    elif not record.endswith('.'):
    record = record + '.%s.' % zone
    return record

    def parse_zone(zone, text):
    """
    Parse a zone for MX records.

    Iterates through a zone file and pulls
    out relevant information.
    """
    ttl = None
    records = []
    for match in zone_re.finditer(open(text).read()):
    ngrps = match.groupdict()
    if ngrps['ttl']:
    ttl = ngrps['ttl']
    else:
    dom = zoneify(zone, ngrps['dom'])
    dttl = ngrps['dttl'] or ttl
    tgt = zoneify(zone, ngrps['tgt'])
    wt = int(ngrps['wt'])
    records.append(
    MxRecord(wt, dom, dttl, tgt))
    return sorted(records)
    def main(arg_list=None):
    parser = optparse.OptionParser()
    parser.add_option('-z', '--zone', help="Zone Name")
    parser.add_option('-f', '--file', help="Zone File")
    opts, args = parser.parse_args()
    if not opts.zone or not opts.file:
    parser.error("zone and file required")
    results = parse_zone(opts.zone, opts.file)
    print "Mail eXchangers in preference order:"
    print
    for mx in results:
    print "%s %6s %4d %s" % \
    (mx.dom, mx.dttl, mx.wt, mx.tgt)

  4. Next, we're going to change the entry_points dictionary passed into setup() within setup.py to the following:

    entry_points = {
    'console_scripts': [
    'logscan = logscan.cmd:main',
    'mx_order = dnszone.mx_order:main'
    ]
    },

  5. Within the package directory, re-run setup.py develop so it picks up the new entry points.

    (text_processing)$ python setup.py develop

  6. Finally, let's run the application and check the output.

    (text_processing)$ mx_order -z jmcneil.net -f example.zone

What just happened?

We loaded an entire zone file into memory and processed it for mail exchanger records. If we came across a TTL, we used that as our default. If a per-record TTL was specified, we used that as it's more specific. Let's step through the code.

The very first lines, other than our import statements, are the regular expressions we'll use to process this file. In this case, we define two and then join them together around a surrounding | operator. This is to illustrate that it's entirely possible to build regular expressions dynamically.

Next, we compile the union of both singular regular expressions and bind it to an attribute named zone_re. Note that we pass two compilation flags here: re.IGNORECASE and re.MULTILINE. We're going to search case in a case-insensitive manner and we want to process an entire chunk of data at once, rather than a clean line.

The zoneinfy function handles a number of our record-naming requirements. Here, we append the zone name wherever applicable.

The parse_zone function attempts to match our regular expression against every line in the file read in. Note that because we've specified re.MULTILINE, ^ will match following any new line and $ will match immediately before one. By default, these only match at the actual beginning and end of a string, respectively.

We loop through all of the results and assign a named groups dictionary to ngrps. Here, you'll see something slightly strange. Whereas a standard Python dict will raise a KeyError if a key used does not exist, this version of a dictionary will return None.

If a TTL exists then we pull the value out and use that as our default TTL. Otherwise, we parse the record as if it's an MX.

Finally, we assign values to a named tuple and sort it. Tuples sort first based on the first element; in this case, the weight. This is exactly the behavior we're after.

Finally, we wrap the whole thing up in our main function, which we've referenced from setup.py. This is what is called when mx_order is executed on the command line.

The regular expression we used to parse the file is somewhat long; however, we've covered every element included. At this point, you should be able to piece through it and make sense of it. However, there are a few things to note:

  • As we dynamically join the strings together, it's not readily apparent that MX matches with two empty group matches for the TTL portion of the search. This is one reason (?P&ltn&gt...) naming is helpful position is a non-issue.
  • A semicolon begins a comment, and comments are allowed at the end of a line. We did not account for that here.
  • If a TTL is not set via $TTL and does not appear in the record itself, the value from the DNS SOA record is used. We've not touched on SOA processing here.
  • For more information on BIND and zone file format, check out http://www.isc.org. The Internet Software Consortium produces and ships the daemon and a collection of revolver tools.

Have a go hero – adding support for $ORIGIN

So, we lied a little bit when we stated that the name of the zone replaces @ and is appended to any name without a trailing dot. Strictly speaking, the value of $ORIGIN is used in both of those situations. If not set, $ORIGIN defaults to the name of the zone.

Syntactically speaking, $ORIGIN is defined exactly like a $TTL is defined. The string "$ORIGIN" appears and is followed immediately by a new DNS name.

Update the preceding code such that if an $ORIGIN name.com appears, subsequent insertions of the zone name use that rather than what we've passed on the command line.

For bonus points, update the regular expressions used, and the zoneify method to avoid using the endswith method of the string objects.

Pop Quiz – understanding the Pythonisms

  1. What is the major difference between the match method and the search method? Where might you prefer one to the other?
  2. What's the benefit to using finditer over findall?
  3. Is there a downside to using Python's named-group feature? Why might you avoid that approach?

Summary

In this article, we looked at both regular expression syntax and the Python-specific implementation details. You should have a solid grasp of Python regular expressions and understand how to implement them.

In this article, we broke apart a regular expression graphically in order to help you understand how the pieces fit together. We built on that knowledge to parse HTML data, BIND zone files, and even internationalized characters in the Cyrillic alphabet.

Finally, we covered some Python specifics. These are non-portable additions on the Python regular expression implementation.


Further resources on this subject:


About the Author :


Jeff McNeil

Jeff McNeil has been working in the Internet Services industry for over 10 years. He cut his teeth during the late 90's Internet boom and has been developing software for Unix and Unix-flavored systems ever since. Jeff has been a full-time Python developer for the better half of that time and has professional experience with a collection of other languages, including C, Java, and Perl. He takes an interest in systems administration and server automation problems. Jeff recently joined Google and has had the pleasure of working with some very talented individuals.

Books From Packt


Python Testing: Beginner's Guide
Python Testing: Beginner's Guide

Python 3 Object Oriented Programming
Python 3 Object Oriented Programming

Spring Python 1.1
Spring Python 1.1

Scribus 1.3.5: Beginner's Guide
Scribus 1.3.5: Beginner's Guide

Plone 3 Intranets
Plone 3 Intranets

wxPython 2.8 Application Development   Cookbook: RAW
wxPython 2.8 Application Development Cookbook: RAW

MySQL for Python
MySQL for Python

Python Geo-Spatial Developmentt
Python Geo-Spatial Development


Code Download and Errata
Packt Anytime, Anywhere
Register Books
Print Upgrades
eBook Downloads
Video Support
Contact Us
Awards Voting Nominations Previous Winners
Judges Open Source CMS Hall Of Fame CMS Most Promising Open Source Project Open Source E-Commerce Applications Open Source JavaScript Library Open Source Graphics Software
Resources
Open Source CMS Hall Of Fame CMS Most Promising Open Source Project Open Source E-Commerce Applications Open Source JavaScript Library Open Source Graphics Software