Understanding the Python regex engine

For Python developers, this concise and down-to-earth guide to regular expressions is all you need to gain vital new knowledge. From a theoretical overview to Python specifics, it explains everything in crystal clear language.

(For more resources related to this topic, see here.)

These are the most common characteristics of the algorithm:

  • It supports "lazy quantifiers" such as *?, +?, and ??.
  • It matches the first coincidence, even though there are longer ones in the string.

    >>>re.search("engineer | engineering", "engineering").group()'engineer'

    This also means that order is important.

  • The algorithm tracks only one transition at one step, which means that the engine checks one character at a time.
  • Backreferences and capturing parentheses are supported.
  • Backtracking is the ability to remember the last successful position so that it can go back and retry if needed
  • In the worst case, complexity is exponential O(Cn). We'll see this later in Backtracking.

Backtracking

Backtracking allows going back and repeating the different paths of the regular expression. It does so by remembering the last successful position, this applies to alternation and quantifiers, let’s see an example:

Backtracking

As we see in the image the regex engine tries to match one character at a time until it fails and starts again with the following path it can retry.

The regex used in the image is the perfect example of the importance of how the regex is built, in this case the expression can be rebuild as spa (in | niard), so that the regex engine doesn’t have to go back up to the start of the string in order to retry the second alternative.

This leads us to what is called catastrophic backtracking a well-known problem with backtracking that can give you several problems, ranging from slow regex to a crash with a stack overflow.

In the previous example, you can see that the behavior grows not only with the input but also with the different paths in the regex, that’s why the algorithm is exponential O(Cn), with this in mind it’s easy to understand why we can end up with a stack overflow.

The problem arises when the regex fails to match the String. Let’s benchmark a regex with technique we’ve seen previously, so we can understand the problem better.

First let’s try a simple regex:

>>> def catastrophic(n): print "Testing with %d characters" %n pat = re.compile('(a+)+c') text = "%s" %('a' * n) pat.search(text)

As you can see the text we’re trying to match it’s always going to fail due to there is no c at the end. Let’s test it with different inputs.

>>> for n in range(20, 30): test(catastrophic, n) Testing with 20 characters The function catastrophic lasted: 0.130457 Testing with 21 characters The function catastrophic lasted: 0.245125 …… The function catastrophic lasted: 14.828221 Testing with 28 characters The function catastrophic lasted: 29.830929 Testing with 29 characters The function catastrophic lasted: 61.110949

The behavior of this regex looks like quadratic. But why? what’s happening here? The problem is that (a+) starts greedy, so it tries to get as many a’s as possible and after that it fails to match the (a+)+, that is, it backtracks to the second a, and continue consuming a’s until it fails to match c, when it tries again (backtrack) the whole process starting with the second a.

Let’s see another example, in this case with an exponential behavior:

>>> def catastrophic(n): print "Testing with %d characters" %n pat = re.compile('(x+)+(b+)+c') text = 'x' * n text += 'b' * n pat.search(text) >>> for n in range(12, 18): test(catastrophic, n) Testing with 12 characters The function catastrophic lasted: 1.035162 Testing with 13 characters The function catastrophic lasted: 4.084714 Testing with 14 characters The function catastrophic lasted: 16.319145 Testing with 15 characters The function catastrophic lasted: 65.855182 Testing with 16 characters The function catastrophic lasted: 276.941307

As you can see the behavior is exponential, which can lead to a catastrophic scenarios. And finally let’s see what happen when regex has a match.

>>> def non_catastrophic(n): print "Testing with %d characters" %n pat = re.compile('(x+)+(b+)+c') text = 'x' * n text += 'b' * n text += 'c' pat.search(text) >>> for n in range(12, 18): test(non_catastrophic, n) Testing with 10 characters The function catastrophic lasted: 0.000029 …… Testing with 19 characters The function catastrophic lasted: 0.000012

Optimization recommendations

In the following sections we will find a number of recommendations that could be used to apply to improve regular expressions.

The best tool will always be the common sense, and even following these recommendations common sense will need to be used. It has to be understood when the recommendation is applicable and when not. For instance the recommendation don’t be greedy cannot be used in the 100% of the cases.

Reuse compiled patterns

To use a regular expression we have to convert it from the string representation to a compiled form as RegexObject.

This compilation takes some time. If instead of using the compile function we are using the rest of the methods to avoid the creation of the RegexObject, we should understand that the compilation is executed anyway and a number of compiled RegexObject are cached automatically.

However, when we are compiling that cache won’t back us. Every single compile execution will consume an amount of time that perhaps could be negligible for a single execution, but it’s definitely relevant if many executions are performed.

Extract common parts in alternation

Alternation is always a performance risk point in regular expressions. When using them in Python, and therefore in a sort of NFA implementation, we should extract any common part outside of the alternation.

For instance if we have /(Hello⇢World|Hello⇢Continent|Hello⇢Country,)/, we could easily extract Hello⇢ having the following expression /Hello⇢(World|Con
tinent|Country)/
. This will make our engine to just check Hello⇢ once, and not going back and recheck for each possibility.

Shortcut the alternation

Ordering in alternation is relevant; each of the different options present in the alternation will be checked one by one, from the left to the right. This can be used in favor of performance.

If we place the more likely options at the beginning of the alternation, more checks will mark the alternation as matched sooner.

For instance, we know that the more common colors of cars are white and black. If we are writing a regular expression accepting some colors, we should put white and black first, as those are the more likely to appear. This is: /(white|black|red|blue|green)/.

For the rest of the elements, if they have the very same odds of appearing, if could be favorable to put the shortest ones before the longer ones.

Use non capturing groups when appropriate

Capturing groups will consume some time per each group defined in an expression. This time is not very important but is still relevant if we are executing a regular expression many times.

Sometimes we are using groups but we might not be interested in the result. For instance when using alternation. If that is the case we can save some execution time to the engine by marking that group as non-capturing. This is: (?:person|company)..

Be specific

When the patterns we define are very specific, the engine can help us performing quick integrity checks before the actual pattern matching is executed.

For instance, if we pass to the engine the expression /\w{15}/ to be matched against the text hello, the engine could decide to check if the input string is actually at least 15 characters long instead of matching the expression.

Don’t be greedy

The quantifiers and we learnt the difference between greedy and reluctant quantifiers. We also found that the quantifiers are greedy by default.

What does this mean to performance? It means that the engine will always try to catch as many characters as possible and then reducing the scope step by step until it's done. This could potentially make the regular expression slow if the match is typically short. Keep in mind, however, this is only applicable if the match is usually short.

Summary

In this article, we understood how to see the engine working behind the scenes.

We learned some theory of the engine design and how it's easy to fall in a common pitfall—the catastrophic backtracking. Finally, we reviewed different general recommendations to improve the performance of our regular expressions.

Resources for Article:


Further resources on this subject:


Books to Consider

comments powered by Disqus
X

An Introduction to 3D Printing

Explore the future of manufacturing and design  - read our guide to 3d printing for free