Until this point, we have learned different mechanisms of matching characters while discarding them. A character that is already matched cannot be compared again, and the only way to match any upcoming character is by discarding it.
The exceptions to this are a number of metacharacters we have studied, the so-called zero-width
assertions. These characters indicate positions rather than actual content. For instance, the caret symbol (^
) is a representation of the beginning of a line or the dollar sign ($
) for the end of a line. They just ensure that the position in the input is correct without actually consuming or matching any character.
A more powerful kind of zero-width assertion is look around, a mechanism with which it is possible to match a certain previous (look behind) or ulterior (look ahead) value to the current position. They effectively do assertion without consuming characters; they just return a positive or negative result of the match.
The look around mechanism...
The first type of look around mechanism that we are going to study is the look ahead mechanism. It tries to match ahead the subexpression passed as an argument. The zero-width nature of the two look around operations render them complex and difficult to understand.
As we know from the previous section, it is represented as an expression preceded by a question mark and an equals sign, ?=
, inside a parenthesis block: (?=regex)
.
Let's start tackling this by comparing the result of the two similar regular expressions. We can recall that in Chapter 1, Introducing Regular Expressions, we matched the expression /fox/
to the phrase The quick brown fox jumps over the lazy dog
. Let's also apply the expression /(?=fox)/
to the same input:
We just searched the literal fox
in the input string, and just as expected we have...
Look around and substitutions
The zero-width nature of the look around operation is especially useful in substitutions. Thanks to them, we are able to perform transformations that would otherwise be extremely complex to read and write.
One typical example of look ahead and substitutions would be the conversion of a number composed of just numeric characters, such as 1234567890, into a comma separated number, that is, 1,234,567,890.
In order to write this regular expression, we will need a strategy to follow. What we want to do is group the numbers in blocks of three that will then be substituted by the same group plus a comma character.
We can easily start with an almost naive approach with the following highlighted regular expression:
We have failed in this attempt. We are effectively grouping in blocks of three numbers, but they should be taken from the right to...
We could safely define look behind as the opposite operation to look ahead. It tries to match behind the subexpression passed as an argument. It has a zero-width nature as well, and therefore, it won't be part of the result.
It is represented as an expression preceded by a question mark, a less-than sign, and an equals sign, ?<=
, inside a parenthesis block: (?<=regex)
.
We could, for instance, use it in an example similar to the one we used in negative look ahead to find just the surname of someone named John McLane
. To accomplish this, we could write a look behind like the following:
With the preceding look behind, we requested the regex engine to match only positions that are preceded with John
and a whitespace to then consume McLane...
Another beneficial use of look around constructions is inside groups. Typically, when groups are used, a very specific result has to be matched and returned inside the group. As we don't want to pollute the groups with information that is not required, among other potential options we can leverage look around as a favorable solution.
Let's say that we need to get a comma-separated value, the first part of the value is a name, while the second is a value. The format would be similar to this:
As we learned in Chapter 3, Grouping, we can easily write an expression that will get these two values like the following:
However, we only want to match when the failure is not an authentication failure. We can accomplish this with the addition of a negative look behind. It will look like this:
Once we put this in Python's console, we will...
In this chapter, we learned the concept of zero-with assertions and how it can be useful to find the exact thing in a text without interfering in the result content.
We have also learned how to leverage the four types of look around mechanisms: positive look ahead, negative look ahead, positive look behind, and negative look behind.
We also reviewed, with special interest, the limitation of the two types of look behind with the variable assertions.
With this, we conclude the travel through the basic and advanced techniques around regular expressions. Now, we are ready to focus on performance tuning in the next chapter.