Chapter 5. Performance of Regular Expressions
Up to this point, we worried about learning how to leverage a feature or obtain a result without caring too much about how fast the process would be. Our only goals were correctness and readability.
In this chapter, we are going to steer towards a completely different concern—performance. However, we will find that often an improvement in performance will result in degradation of readability. When we are modifying something to make it faster, we are probably making it easier for the machine to understand, and therefore, we are probably compromising on human readability.
On December 4, 1974, Donald Knuth, the author of the famous book The Art of Computer Programming, wrote the paper Structured Programming with go-to
statements. This well-known quote is extracted from the paper:
"Programmers waste enormous amounts of time thinking about, or worrying about, the speed of noncritical parts of their programs, and these attempts at efficiency actually...
Benchmarking regular expressions with Python
In order to benchmark our regex, we're going to measure the time a regex takes to execute. It's important to test them with different inputs, because with small inputs almost every regex is fast enough. However, with longer ones it can be a completely different beast, as we will see later in the section Backtracking.
First, we're going to create a small function to help us with this task:
So, we can test a regex using the following code:
Python comes with a built-in profiler http://docs.python.org/2/library/profile.html that we can also use to measure the time and the...
Among the different tools available for better productivity while writing regular expressions, RegexBuddy (http://www.regexbuddy.com/) by Just Great Software Co. Ltd. is outstanding.
The mastermind behind Just Great Software is Jan Goyvaerts, the same person who is behind Regular-Expressions.info (http://www.regular-expressions.info/), one of the most well-known references on the Internet for regular expressions.
With RegexBuddy, we can use a visual interface for building, testing, and debugging regular expressions. The debug feature is almost unique, and provides a great mechanism to understand how the regular expression engine is working behind the scenes. In the following screenshot, we can see RegexBuddy debugging the execution of a regular expression:
It does have other features, such as a library of commonly used regular expressions and a code generator for different programming environments.
Although it has a couple of downsides...
Understanding the Python regex engine
The re
module uses a backtracking regular expression engine; although, in the very well-known book Mastering Regular Expressions by Jeffrey E. F. Friedl, it is classified as Nondeterministic Finite Automata (NFA) type. Also, according to Tim Peters (https://mail.python.org/pipermail/tutor/2006-January/044335.html), the module is not purely NFA.
These are the most common characteristics of the algorithm:
It supports "lazy quantifiers" such as *?
, +?
, and ??
.
It matches the first coincidence, even though there are longer ones in the string.
This also means that order is important.
The algorithm tracks only one transition at one step, which means that the engine checks one character at a time.
Backreferences and capturing parentheses are supported.
Backtracking is the ability to remember the last successful position so that it can go back and retry if needed
In the worst case, complexity...
Optimization recommendations
In the following sections, we will find a number of recommendations that could be applied to improve regular expressions.
The best tool will always be common sense, and common sense will need to be used even while following these recommendations. It has to be understood when the recommendation is applicable and when it is not. For instance, the recommendation don't be greedy cannot be used in all the cases.
We have learned in Chapter 2, Regular Expressions with Python, that to use a regular expression we have to convert it from its string representation to a compiled form as RegexObject
.
This compilation takes some time. If we are using the rest of the module operations instead of using the compile function to avoid the creation of the RegexObject
, we should understand that the compilation is executed anyway and a number of compiled RegexObject
are cached automatically.
However, when we are compiling, that cache won't back us. Every single...
In this final chapter, we have started learning the relevance of optimization and why we should avoid premature optimization by measuring. Then, we jumped into the topic of measuring by learning different mechanisms to measure the time of execution for our regular expressions. Later, we found out about the RegexBuddy tool that can help us to understand how the engine is doing its work and aiding us in pinpointing the performance problems.
Later on, we understood how to see the engine working behind the scenes. We learned some theory of the engine design and how it's easy to fall in a common pitfall—the catastrophic backtracking.
Finally, we reviewed different general recommendations to improve the performance of our regular expressions.