The path to mastering performance in Python has just started. Profiling only takes us half way there. Measuring how our program is using the resources at its disposal only tells us where the problem is, not how to fix it.
In this article by Fernando Doglio, author of the book Mastering Python High Performance, we will cover the process of optimization, and to do that, we need to start with the basics. We'll keep it inside the language for now: no external tools, just Python and the right way to use it.
We will cover the following topics in this article:
(For more resources related to this topic, see here.)
This is one of the most common techniques used to improve the performance of a piece of code (namely a function). We can save the results of expensive function calls associated to a specific set of input values and return the saved result (instead of redoing the whole computation) when the function is called with the remembered input. It might be confused with caching, since it is one case of it, although this term refers also, to other types of optimization (such as HTTP caching, buffering, and so on)
This methodology is very powerful, because in practice, it'll turn what should have been a potentially very expensive call into a O(1) function call if the implementation is right. Normally, the parameters are used to create a unique key, which is then used on a dictionary to either save the result or obtain it if it's been already saved.
There is, of course, a trade-off to this technique. If we're going to be remembering the returned values of a memoized function, then we'll be exchanging memory space for speed. This is a very acceptable trade-off, unless the saved data becomes more than what the system can handle.
Classic use cases for this optimization are function calls that repeat the input parameters often. This will assure that most of the times, the memoized results are returned. If there are many function calls but with different parameters, we'll only store results and spend our memory without any real benefit, as seen in the following diagram:
You can clearly see how the blue bar (Fixed params, memoized) is clearly the fastest use case, while the others are all similar due to their nature.
Here is the code that generates values for the preceding chart. To generate some sort of time-consuming function, the code will call either the twoParams function or the twoParamsMemoized function several hundred times under different conditions, and it will log the execution time:
import math
import time
import random
class Memoized:
def __init__(self, fn):
self.fn = fn
self.results = {}
def __call__(self, *args):
key = ''.join(map(str, args[0]))
try:
return self.results[key]
except KeyError:
self.results[key] = self.fn(*args)
return self.results[key]
@Memoized
def twoParamsMemoized(values, period):
totalSum = 0
for x in range(0, 100):
for v in values:
totalSum = math.pow((math.sqrt(v) * period), 4) + totalSum
return totalSum
def twoParams(values, period):
totalSum = 0
for x in range(0, 100):
for v in values:
totalSum = math.pow((math.sqrt(v) * period), 4) + totalSum
return totalSum
def performTest():
valuesList = []
for i in range(0, 10):
valuesList.append(random.sample(xrange(1, 101), 10))
start_time = time.clock()
for x in range(0, 10):
for values in valuesList:
twoParamsMemoized(values, random.random())
end_time = time.clock() - start_time
print "Fixed params, memoized: %s" % (end_time)
start_time = time.clock()
for x in range(0, 10):
for values in valuesList:
twoParams(values, random.random())
end_time = time.clock() - start_time
print "Fixed params, without memoizing: %s" % (end_time)
start_time = time.clock()
for x in range(0, 10):
for values in valuesList:
twoParamsMemoized(random.sample(xrange(1,2000), 10), random.random())
end_time = time.clock() - start_time
print "Random params, memoized: %s" % (end_time)
start_time = time.clock()
for x in range(0, 10):
for values in valuesList:
twoParams(random.sample(xrange(1,2000), 10), random.random())
end_time = time.clock() - start_time
print "Random params, without memoizing: %s" % (end_time)
performTest()
The main insight to take from the preceding chart is that just like with every aspect of programming, there is no silver bullet algorithm that will work for all cases. Memoization is clearly a very basic way of optimizing code, but clearly, it won't optimize anything given the right circumstances.
As for the code, there is not much to it. It is a very simple, non real-world example of the point I was trying to send across. The performTest function will take care of running a series of 10 tests for every use case and measure the total time each use case takes. Notice that we're not really using profilers at this point. We're just measuring time in a very basic and ad-hoc way, which works for us.
The input for both functions is simply a set of numbers on which they will run some math functions, just for the sake of doing something.
The other interesting bit about the arguments is that, since the first argument is a list of numbers, we can't just use the args parameter as a key inside the Memoized class' methods. This is why we have the following line:
key = ''.join(map(str, args[0]))
This line will concatenate all the numbers from the first parameter into a single value, which will act as the key. The second parameter is not used here, because it's always random, which would imply that the key would never be the same.
Another variation of the preceding method is to pre-calculate all values from the function (assuming we have a limited number of inputs of course) during initialization and then refer to the lookup table during execution. This approach has several preconditions:
There are different approaches when it comes to architecting the lookup table, all offering different types of optimizations. It all depends on the type of application and solution that you're trying to optimize. Here is a set of examples.
This solution works by iterating over an unsorted list and checking the key against each element, with the associated value as the result we're looking for.
This is obviously a very slow method of implementation, with a big O notation of O(n) for both the average and worst case scenarios. Still, given the right circumstances, it could prove to be faster than calling the actual function every time.
In this case, using a linked list would improve the performance of the algorithm over using a simple list. However, it would still depend heavily on the type of linked list it is (doubly linked list, simple linked list with direct access to the first and last elements, and so on).
This method works using a one-dimensional dictionary lookup, indexed by a key consisting of the input parameters (enough of them create a unique key). In particular cases (like we covered earlier), this is probably one of the fastest lookups, even faster than binary search in some cases with a constant execution time (big O notation of O(1)).
Note that this approach is efficient as long as the key-generation algorithm is capable of generating unique keys every time. Otherwise, the performance could degrade over time due to the many collisions on the dictionaries.
This particular method is only possible if the list is sorted. This could potentially be an option depending on the values to sort. Yet, sorting them would require an extra effort that would hurt the performance of the entire effort. However, it presents very good results even in long lists (average big O notation of O(log n)). It works by determining in which half of the list the value is and repeating until either the value is found or the algorithm is able to determine that the value is not in the list.
To put all of this into perspective, looking at the Memoized class mentioned earlier, it implements a simple lookup on a dictionary. However, this would be the place to implement either of the other algorithms.
There are some classic example use cases for this type of optimization, but the most common one is probably the optimization of trigonometric functions. Based on the computing time, these functions are really slow. When used repeatedly, they can cause some serious damage to your program's performance.
This is why it is normally recommended to precalculate the values of these functions. For functions that deal with an infinite domain universe of possible input values, this task becomes impossible. So, the developer is forced to sacrifice accuracy for performance by precalculating a discrete subdomain of the possible input values (that is, going from floating points down to integer numbers).
This approach might not be ideal in some cases, since some systems require both performance and accuracy. So, the solution is to meet in the middle and use some form of interpolation to calculate the wanted value, based on the ones that have been precalculated. It will provide better accuracy. Even though it won't be as performant as using the lookup table directly, it should prove to be faster than doing the trigonometric calculation every time.
Let's look at some examples of this, for instance, for the following trigonometric function:
def complexTrigFunction(x):
return math.sin(x) * math.cos(x)**2
We'll take a look at how simple precalculation won't be accurate enough and how some form of interpolation will result in a better level of accuracy.
The following code will precalculate the values for the function on a range from -1000 to 1000 (only integer values). Then, it'll try to do the same calculation (only for a smaller range) for floating point numbers:
import math
import time
from collections import defaultdict
import itertools
trig_lookup_table = defaultdict(lambda: 0)
def drange(start, stop, step):
assert(step != 0)
sample_count = math.fabs((stop - start) / step)
return itertools.islice(itertools.count(start, step), sample_count)
def complexTrigFunction(x):
return math.sin(x) * math.cos(x)**2
def lookUpTrig(x):
return trig_lookup_table[int(x)]
for x in range(-1000, 1000):
trig_lookup_table[x] = complexTrigFunction(x)
trig_results = []
lookup_results = []
init_time = time.clock()
for x in drange(-100, 100, 0.1):
trig_results.append(complexTrigFunction(x))
print "Trig results: %s" % (time.clock() - init_time)
init_time = time.clock()
for x in drange(-100, 100, 0.1):
lookup_results.append(lookUpTrig(x))
print "Lookup results: %s" % (time.clock() - init_time)
for idx in range(0, 200):
print "%st%s" % (trig_results [idx], lookup_results[idx])
The results from the preceding code will help demonstrate how the simple lookup table approach is not accurate enough (see the following chart). However, it compensates for it on speed since the original function takes 0.001526 seconds to run while the lookup table only takes 0.000717 seconds.
The preceding chart shows how the lack of interpolation hurts the accuracy. You can see how even though both plots are quite similar, the results from the lookup table execution aren't as accurate as the trig function used directly. So, now, let's take another look at the same problem. However, this time, we'll add some basic interpolation (we'll limit the rage of values from -PI to PI):
import math
import time
from collections import defaultdict
import itertools
trig_lookup_table = defaultdict(lambda: 0)
def drange(start, stop, step):
assert(step != 0)
sample_count = math.fabs((stop - start) / step)
return itertools.islice(itertools.count(start, step), sample_count)
def complexTrigFunction(x):
return math.sin(x) * math.cos(x)**2
reverse_indexes = {}
for x in range(-1000, 1000):
trig_lookup_table[x] = complexTrigFunction(math.pi * x / 1000)
complex_results = []
lookup_results = []
init_time = time.clock()
for x in drange(-10, 10, 0.1):
complex_results .append(complexTrigFunction(x))
print "Complex trig function: %s" % (time.clock() - init_time)
init_time = time.clock()
factor = 1000 / math.pi
for x in drange(-10 * factor, 10 * factor, 0.1 * factor):
lookup_results.append(trig_lookup_table[int(x)])
print "Lookup results: %s" % (time.clock() - init_time)
for idx in range(0, len(lookup_results )):
print "%st%s" % (complex_results [idx], lookup_results [idx])
As you might've noticed in the previous chart, the resulting plot is periodic (specially because we've limited the range from -PI to PI). So, we'll focus on a particular range of values that will generate one single segment of the plot.
The output of the preceding script also shows that the interpolation solution is still faster than the original trigonometric function, although not as fast as it was earlier:
Interpolation Solution |
Original function |
0.000118 seconds |
0.000343 seconds |
The following chart is a bit different from the previous one, especially because it shows (in green) the error percentage between the interpolated value and the original one:
The biggest error we have is around 12 percent (which represents the peaks we see on the chart). However, it's for the smallest values, such as -0.000852248551417 versus -0.000798905501416. This is a case where the error percentage needs to be contextualized to see if it really matters. In our case, since the values related to that error are so small, we can ignore that error in practice.
There are other use cases for lookup tables, such as in image processing. However, for the sake of this article, the preceding example should be enough to demonstrate their benefits and trade-off implied in their usage.
Another optimization technique, one that is contrary to memoization, is not particularly generic. Instead, it is directly tied to how the Python interpreter works.
Default arguments can be used to determine values once at function creation time instead of at run time.
This can only be done for functions or objects that will not be changed during program execution.
Let's look at an example of how this optimization can be applied. The following code shows two versions of the same function, which does some random trigonometric calculation:
import math
#original function
def degree_sin(deg):
return math.sin(deg * math.pi / 180.0)
#optimized function, the factor variable is calculated during function creation time,
#and so is the lookup of the math.sin method.
def degree_sin(deg, factor=math.pi/180.0, sin=math.sin):
return sin(deg * factor)
This optimization can be problematic if not correctly documented. Since it uses attributes to precompute terms that should not change during the program's execution, it could lead to the creation of a confusing API.
With a quick and simple test, we can double-check the performance gain from this optimization:
import time
import math
def degree_sin(deg):
return math.sin(deg * math.pi / 180.0) * math.cos(deg * math.pi / 180.0)
def degree_sin_opt(deg, factor=math.pi/180.0, sin=math.sin, cos = math.cos):
return sin(deg * factor) * cos(deg * factor)
normal_times = []
optimized_times = []
for y in range(100):
init = time.clock()
for x in range(1000):
degree_sin(x)
normal_times.append(time.clock() - init)
init = time.clock()
for x in range(1000):
degree_sin_opt(x)
optimized_times.append(time.clock() - init)
print "Normal function: %s" % (reduce(lambda x, y: x + y, normal_times, 0) / 100)
print "Optimized function: %s" % (reduce(lambda x, y: x + y, optimized_times, 0 ) / 100)
The preceding code measures the time it takes for the script to finish each of the versions of the function to run its code 1000 times. It saves those measurements, and finally, it does an average for each case. The result is displayed in the following chart:
It clearly isn't an amazing optimization. However, it does shave off some microseconds from our execution time, so we'll keep it in mind. Just remember that this optimization could cause problems if you're working as part of an OS developer team.
In this article, we covered several optimization techniques. Some of them are meant to provide big boosts on speed, save memory. Some of them are just meant to provide minor speed improvements. Most of this article covered Python-specific techniques, but some of them can be translated into other languages as well.
Further resources on this subject: