Packt+ | Advance your knowledge in tech

You're reading from Mastering Python Regular Expressions

Product type Book

Published in Feb 2014

Publisher Packt

ISBN-13 9781783283156

Pages 110 pages

Edition 1st Edition

Languages

Python

Concepts

Programming Language

Table of Contents (12) Chapters

Mastering Python Regular Expressions

Credits

About the Authors

About the Reviewers

www.PacktPub.com

Preface

Introducing Regular Expressions

Regular Expressions with Python

Grouping

Look Around

Performance of Regular Expressions

Index

Chapter 4. Look Around

Until this point, we have learned different mechanisms of matching characters while discarding them. A character that is already matched cannot be compared again, and the only way to match any upcoming character is by discarding it.

The exceptions to this are a number of metacharacters we have studied, the so-called zero-width assertions. These characters indicate positions rather than actual content. For instance, the caret symbol (^) is a representation of the beginning of a line or the dollar sign ($) for the end of a line. They just ensure that the position in the input is correct without actually consuming or matching any character.

A more powerful kind of zero-width assertion is look around, a mechanism with which it is possible to match a certain previous (look behind) or ulterior (look ahead) value to the current position. They effectively do assertion without consuming characters; they just return a positive or negative result of the match.

The look around mechanism...

Look ahead

The first type of look around mechanism that we are going to study is the look ahead mechanism. It tries to match ahead the subexpression passed as an argument. The zero-width nature of the two look around operations render them complex and difficult to understand.

As we know from the previous section, it is represented as an expression preceded by a question mark and an equals sign, ?=, inside a parenthesis block: (?=regex).

Let's start tackling this by comparing the result of the two similar regular expressions. We can recall that in Chapter 1, Introducing Regular Expressions, we matched the expression /fox/ to the phrase The quick brown fox jumps over the lazy dog. Let's also apply the expression /(?=fox)/ to the same input:

>>>pattern = re.compile(r'fox')
>>>result = pattern.search("The quick brown fox jumps over the lazy dog")
>>>print result.start(), result.end()
16 19

We just searched the literal fox in the input string, and just as expected we have...

Look around and substitutions

The zero-width nature of the look around operation is especially useful in substitutions. Thanks to them, we are able to perform transformations that would otherwise be extremely complex to read and write.

One typical example of look ahead and substitutions would be the conversion of a number composed of just numeric characters, such as 1234567890, into a comma separated number, that is, 1,234,567,890.

In order to write this regular expression, we will need a strategy to follow. What we want to do is group the numbers in blocks of three that will then be substituted by the same group plus a comma character.

We can easily start with an almost naive approach with the following highlighted regular expression:

>>>pattern = re.compile(r'\d{1,3}')
>>>pattern.findall("The number is: 12345567890")
['123', '455', '678', '90']

We have failed in this attempt. We are effectively grouping in blocks of three numbers, but they should be taken from the right to...

Look behind

We could safely define look behind as the opposite operation to look ahead. It tries to match behind the subexpression passed as an argument. It has a zero-width nature as well, and therefore, it won't be part of the result.

It is represented as an expression preceded by a question mark, a less-than sign, and an equals sign, ?<=, inside a parenthesis block: (?<=regex).

We could, for instance, use it in an example similar to the one we used in negative look ahead to find just the surname of someone named John McLane. To accomplish this, we could write a look behind like the following:

>>>pattern = re.compile(r'(?<=John\s)McLane')
>>>result = pattern.finditer("I would rather go out with John McLane than with John Smith or John Bon Jovi")
>>>for i in result:
...    print i.start(), i.end()
...
32 38

With the preceding look behind, we requested the regex engine to match only positions that are preceded with John and a whitespace to then consume McLane...

Look around and groups

Another beneficial use of look around constructions is inside groups. Typically, when groups are used, a very specific result has to be matched and returned inside the group. As we don't want to pollute the groups with information that is not required, among other potential options we can leverage look around as a favorable solution.

Let's say that we need to get a comma-separated value, the first part of the value is a name, while the second is a value. The format would be similar to this:

INFO 2013-09-17 12:13:44,487 authentication failed

As we learned in Chapter 3, Grouping, we can easily write an expression that will get these two values like the following:

/\w+\s[\d-]+\s[\d:,]+\s(.*\sfailed)/

However, we only want to match when the failure is not an authentication failure. We can accomplish this with the addition of a negative look behind. It will look like this:

/\w+\s[\d-]+\s[\d:,]+\s(.*(?<!authentication\s)failed)/

Once we put this in Python's console, we will...

Summary

In this chapter, we learned the concept of zero-with assertions and how it can be useful to find the exact thing in a text without interfering in the result content.

We have also learned how to leverage the four types of look around mechanisms: positive look ahead, negative look ahead, positive look behind, and negative look behind.

We also reviewed, with special interest, the limitation of the two types of look behind with the variable assertions.

With this, we conclude the travel through the basic and advanced techniques around regular expressions. Now, we are ready to focus on performance tuning in the next chapter.