Understanding the Python regex engine

Exclusive offer: get 50% off this eBook here
Mastering Python Regular Expressions

Mastering Python Regular Expressions — Save 50%

Leverage regular expressions in Python even for the most complex features with this book and ebook

$14.99    $7.50
by Félix López Víctor Romero | February 2014 | Open Source

In this article by Félix López and Víctor Romero, the author of Mastering Python Regular Expressions, we are going to learn about backtracking using Python. The Python’s Re module uses a backtracking regular expression engine. Although in the very well known book “Mastering regular expressions” by Jeffrey E.F. Friedl is classified as Nondeterministic finite automata (NFA) type, according to Tim Peters https://mail.python.org/pipermail/tutor/2006-January/044335.html is not NFA purely.

(For more resources related to this topic, see here.)

These are the most common characteristics of the algorithm:

  • It supports "lazy quantifiers" such as *?, +?, and ??.
  • It matches the first coincidence, even though there are longer ones in the string.

    >>>re.search("engineer | engineering", "engineering").group()'engineer'

    This also means that order is important.

  • The algorithm tracks only one transition at one step, which means that the engine checks one character at a time.
  • Backreferences and capturing parentheses are supported.
  • Backtracking is the ability to remember the last successful position so that it can go back and retry if needed
  • In the worst case, complexity is exponential O(Cn). We'll see this later in Backtracking.

Backtracking

Backtracking allows going back and repeating the different paths of the regular expression. It does so by remembering the last successful position, this applies to alternation and quantifiers, let’s see an example:

Backtracking

As we see in the image the regex engine tries to match one character at a time until it fails and starts again with the following path it can retry.

The regex used in the image is the perfect example of the importance of how the regex is built, in this case the expression can be rebuild as spa (in | niard), so that the regex engine doesn’t have to go back up to the start of the string in order to retry the second alternative.

This leads us to what is called catastrophic backtracking a well-known problem with backtracking that can give you several problems, ranging from slow regex to a crash with a stack overflow.

In the previous example, you can see that the behavior grows not only with the input but also with the different paths in the regex, that’s why the algorithm is exponential O(Cn), with this in mind it’s easy to understand why we can end up with a stack overflow.

The problem arises when the regex fails to match the String. Let’s benchmark a regex with technique we’ve seen previously, so we can understand the problem better.

First let’s try a simple regex:

>>> def catastrophic(n): print "Testing with %d characters" %n pat = re.compile('(a+)+c') text = "%s" %('a' * n) pat.search(text)

As you can see the text we’re trying to match it’s always going to fail due to there is no c at the end. Let’s test it with different inputs.

>>> for n in range(20, 30): test(catastrophic, n) Testing with 20 characters The function catastrophic lasted: 0.130457 Testing with 21 characters The function catastrophic lasted: 0.245125 …… The function catastrophic lasted: 14.828221 Testing with 28 characters The function catastrophic lasted: 29.830929 Testing with 29 characters The function catastrophic lasted: 61.110949

The behavior of this regex looks like quadratic. But why? what’s happening here? The problem is that (a+) starts greedy, so it tries to get as many a’s as possible and after that it fails to match the (a+)+, that is, it backtracks to the second a, and continue consuming a’s until it fails to match c, when it tries again (backtrack) the whole process starting with the second a.

Let’s see another example, in this case with an exponential behavior:

>>> def catastrophic(n): print "Testing with %d characters" %n pat = re.compile('(x+)+(b+)+c') text = 'x' * n text += 'b' * n pat.search(text) >>> for n in range(12, 18): test(catastrophic, n) Testing with 12 characters The function catastrophic lasted: 1.035162 Testing with 13 characters The function catastrophic lasted: 4.084714 Testing with 14 characters The function catastrophic lasted: 16.319145 Testing with 15 characters The function catastrophic lasted: 65.855182 Testing with 16 characters The function catastrophic lasted: 276.941307

As you can see the behavior is exponential, which can lead to a catastrophic scenarios. And finally let’s see what happen when regex has a match.

>>> def non_catastrophic(n): print "Testing with %d characters" %n pat = re.compile('(x+)+(b+)+c') text = 'x' * n text += 'b' * n text += 'c' pat.search(text) >>> for n in range(12, 18): test(non_catastrophic, n) Testing with 10 characters The function catastrophic lasted: 0.000029 …… Testing with 19 characters The function catastrophic lasted: 0.000012

Optimization recommendations

In the following sections we will find a number of recommendations that could be used to apply to improve regular expressions.

The best tool will always be the common sense, and even following these recommendations common sense will need to be used. It has to be understood when the recommendation is applicable and when not. For instance the recommendation don’t be greedy cannot be used in the 100% of the cases.

Reuse compiled patterns

To use a regular expression we have to convert it from the string representation to a compiled form as RegexObject.

This compilation takes some time. If instead of using the compile function we are using the rest of the methods to avoid the creation of the RegexObject, we should understand that the compilation is executed anyway and a number of compiled RegexObject are cached automatically.

However, when we are compiling that cache won’t back us. Every single compile execution will consume an amount of time that perhaps could be negligible for a single execution, but it’s definitely relevant if many executions are performed.

Extract common parts in alternation

Alternation is always a performance risk point in regular expressions. When using them in Python, and therefore in a sort of NFA implementation, we should extract any common part outside of the alternation.

For instance if we have /(Hello⇢World|Hello⇢Continent|Hello⇢Country,)/, we could easily extract Hello⇢ having the following expression /Hello⇢(World|Con
tinent|Country)/
. This will make our engine to just check Hello⇢ once, and not going back and recheck for each possibility.

Shortcut the alternation

Ordering in alternation is relevant; each of the different options present in the alternation will be checked one by one, from the left to the right. This can be used in favor of performance.

If we place the more likely options at the beginning of the alternation, more checks will mark the alternation as matched sooner.

For instance, we know that the more common colors of cars are white and black. If we are writing a regular expression accepting some colors, we should put white and black first, as those are the more likely to appear. This is: /(white|black|red|blue|green)/.

For the rest of the elements, if they have the very same odds of appearing, if could be favorable to put the shortest ones before the longer ones.

Use non capturing groups when appropriate

Capturing groups will consume some time per each group defined in an expression. This time is not very important but is still relevant if we are executing a regular expression many times.

Sometimes we are using groups but we might not be interested in the result. For instance when using alternation. If that is the case we can save some execution time to the engine by marking that group as non-capturing. This is: (?:person|company)..

Be specific

When the patterns we define are very specific, the engine can help us performing quick integrity checks before the actual pattern matching is executed.

For instance, if we pass to the engine the expression /\w{15}/ to be matched against the text hello, the engine could decide to check if the input string is actually at least 15 characters long instead of matching the expression.

Don’t be greedy

The quantifiers and we learnt the difference between greedy and reluctant quantifiers. We also found that the quantifiers are greedy by default.

What does this mean to performance? It means that the engine will always try to catch as many characters as possible and then reducing the scope step by step until it's done. This could potentially make the regular expression slow if the match is typically short. Keep in mind, however, this is only applicable if the match is usually short.

Summary

In this article, we understood how to see the engine working behind the scenes.

We learned some theory of the engine design and how it's easy to fall in a common pitfall—the catastrophic backtracking. Finally, we reviewed different general recommendations to improve the performance of our regular expressions.

Resources for Article:


Further resources on this subject:


Mastering Python Regular Expressions Leverage regular expressions in Python even for the most complex features with this book and ebook
Published: February 2014
eBook Price: $14.99
Book Price: $24.99
See more
Select your format and quantity:

About the Author :


Félix López

Félix López started his career in web development before moving to software in the currency exchange market, where there were a lot of new security challenges. Later, he spent four years creating an IDE to develop games for hundreds of different mobile device OS variations, in addition to creating more than 50 games. Before joining ShuttleCloud, he spent two years working on applications with sensor networks, Arduino, ZigBee, and custom hardware. One example is an application that detects the need for streetlight utilities in major cities based on existing atmospheric brightness. His first experience with Python was seven years ago, He used it for small scripts, web scrapping, and so on. Since then, he has used Python for almost all his projects: websites, standalone applications, and so on. Nowadays, he uses Python along with RabbitMQ in order to integrate services.

He's currently working for ShuttleCloud, an U.S.-based startup, whose technology is used by institutions such as Stanford and Harvard, and companies such as Google.

Víctor Romero

Víctor Romero currently works as a solutions architect at MuleSoft, Inc. He started his career in the dotcom era and has been a regular contributor to open source software ever since. Originally from the sunny city of Malaga, Spain, his international achievements include integrating the applications present in the cloud storage of a skyscraper in New York City, and creating networks for the Italian government in Rome.

Books From Packt


 Python High Performance Programming
Python High Performance Programming

Programming ArcGIS 10.1 with Python Cookbook
Programming ArcGIS 10.1 with Python Cookbook

 Python Geospatial Development
Python Geospatial Development

 Practical Maya Programming with Python
Practical Maya Programming with Python

 Python Geospatial Development - Second Edition
Python Geospatial Development - Second Edition

wxPython 2.8 Application Development Cookbook
wxPython 2.8 Application Development Cookbook

Instant Pygame for Python Game Development How-to [Instant]
Instant Pygame for Python Game Development How-to [Instant]

Python Data Visualization Cookbook
Python Data Visualization Cookbook


Code Download and Errata
Packt Anytime, Anywhere
Register Books
Print Upgrades
eBook Downloads
Video Support
Contact Us
Awards Voting Nominations Previous Winners
Judges Open Source CMS Hall Of Fame CMS Most Promising Open Source Project Open Source E-Commerce Applications Open Source JavaScript Library Open Source Graphics Software
Resources
Open Source CMS Hall Of Fame CMS Most Promising Open Source Project Open Source E-Commerce Applications Open Source JavaScript Library Open Source Graphics Software