Recognizing the slow parts of your program is the single most important task when it comes to speeding up your code. Luckily, in most cases, the code that causes the application to slow down is a very small fraction of the program. By locating those critical sections, you can focus on the parts that need improvement without wasting time in micro-optimization.
Profiling is the technique that allows us to pinpoint the most resource-intensive spots in an application. A profiler is a program that runs an application and monitors how long each function takes to execute, thus detecting the functions in which your application spends most of its time.
Python provides several tools to help us find these bottlenecks and measure importantÂ performance metrics. In this chapter, we will learn how to use the standard
cProfile module and theÂ
line_profilerÂ third-party package.Â We will also learn how to profile an application's memory consumption through theÂ
memory_profiler tool. Another useful tool that we will cover is KCachegrind, which can be usedÂ to graphically display the data produced by various profilers.
Benchmarks are small scripts used toÂ assess the total execution time of your application. We will learn how to write benchmarks and how to accurately time your programs.
The list of topics we will cover in this chapter is as follows:
- General principles of high performanceÂ programming
- Writing tests and benchmarks
- The Unix
- The Python
- Testing and benchmarking with
- Profiling your application
- Interpreting profiling results with KCachegrind
- Disassembling Python code through the
When designing a performance-intensive program, the very first step is to write your code without bothering with small optimizations:
"Premature optimization is the root of all evil."
-Â Donald Knuth
In the early development stages, the design of the program can change quickly and may require large rewrites and reorganizationsÂ of the code base. By testing different prototypes without the burden of optimization, you are free to devote your time and energy to ensure that the program produces correct results and that the design is flexible. After all, who needs an application that runs fast but gives the wrong answer?
The mantras that you should remember when optimizing your code are as follows:
- Make it run: We have to get the software in a working state, and ensure that it produces the correct results. This exploratory phase serves to betterÂ understand the applicationÂ and to spot major design issues in the early stages.
- Make it right: We want to ensure that the design of the program is solid. Refactoring should be done before attempting any performance optimization. This really helps separate the application into independent and cohesive units that are easier to maintain.
- Make it fast: Once our program is working and is well structured, we can focus on performance optimization. We may also want to optimize memory usage if that constitutes an issue.
In this section, we will write and profile a particle simulator test application. The simulator is a program that takes some particles and simulates their movement overÂ time according to a set of laws that we impose. These particles can be abstract entities or correspond to physical objects, for example, billiard balls moving on a table, molecules in gas, stars moving through space, smoke particles, fluids in a chamber, and so on.
Computer simulations are useful in fields such as Physics, Chemistry, Astronomy, and many other disciplines. The applications used to simulate systems are particularly performance-intensive and scientists and engineers spend an inordinate amount of time optimizing these codes. In order to study realistic systems, it is often necessary to simulate a very high number of bodies and every small increase in performance counts.
In our first example, we will simulate a system containing particles that constantly rotate around a central point at various speeds, just like the hands of a clock.
The necessary information to run our simulation will be the starting positions of the particles, the speed, and the rotation direction. From these elements, we have to calculate the position of the particle in the next instant of time. An example system is shown in the following figure. The origin of the system is the
(0, 0) point, the position is indicated by the x, y vector and the velocity is indicated by the vx, vy vector:
The basic feature of a circular motion is that the particles always move perpendicular to the direction connecting the particle and the center. To move the particle, we simply change the position by taking a series of very small steps (which correspond to advancing the system for a small interval of time) in the direction of motion, as shown in the following figure:
We will start by designing the application in an object-oriented way. According to our requirements, it is natural to have a generic
Particle class that stores the particle positions,Â
y, and theirÂ angular velocity,
Â Â Â class Particle: Â Â Â def __init__(self, x, y, ang_vel): Â Â Â self.x = x Â Â Â self.y = y Â Â Â self.ang_vel = ang_vel
Note that we accept positive and negative numbers for all the parameters (the sign of
ang_vel will simply determine the direction of rotation).
Another class, called
ParticleSimulator, will encapsulate theÂ laws of motion and will be responsible for changing the positions of the particles over time. The
__init__ method will store a list of
Particle instances and the
evolve method will change the particle positions according to our laws.
We want the particles to rotate around the position corresponding to theÂ
y=0Â coordinates, at a constant speed. The direction of the particles will always be perpendicular to the direction from the center (refer to the first figure of this chapter). To find the direction of the movement along the x and y axes (corresponding to the PythonÂ
v_yÂ variables), it is sufficient to use these formulae:
Â Â Â v_x = -y / (x**2 + y**2)**0.5 Â Â Â v_y = x / (x**2 + y**2)**0.5
If we let one of our particles move, after a certain time t, it will reach another position following a circular path. We can approximate a circular trajectory by dividing the time interval,Â t, into tiny time steps, dt, where the particle moves in a straight line tangentially to the circle. The final result is just an approximation of a circular motion. In order to avoid a strong divergence, such as the one illustrated in the following figure, it is necessary to take very small time steps:
In a more schematic way, we have to carry out the following steps to calculate the particle position at time t:
- Calculate the direction of motion (
- Calculate the displacement (
d_y), which is the product of time step, angular velocity, and direction of motion.
- Repeat steps 1 and 2 for enough times to cover the total time t.
The following code shows the full
Â Â Â class ParticleSimulator: Â Â Â def __init__(self, particles): Â Â Â self.particles = particles Â Â Â def evolve(self, dt): Â Â Â timestep = 0.00001 Â Â Â nsteps = int(dt/timestep) Â Â Â Â Â Â for i in range(nsteps): Â Â Â for p in self.particles: Â Â Â # 1. calculate the direction Â Â Â norm = (p.x**2 + p.y**2)**0.5 Â Â Â v_x = -p.y/norm Â Â Â v_y = p.x/norm Â Â Â # 2. calculate the displacement Â Â Â d_x = timestep * p.ang_vel * v_x Â Â Â d_y = timestep * p.ang_vel * v_y Â Â Â p.x += d_x Â Â Â p.y += d_y Â Â Â # 3. repeat for all the time steps
We can use the
matplotlib library to visualize our particles. This library is not included in the Python standard library, andÂ it can be easily installed using the
pip install matplotlibÂ command.
Alternatively, you can use the Anaconda Python distribution (https://store.continuum.io/cshop/anaconda/) that includes
matplotlib and most of the other third-party packages used in this book. Anaconda is free and is available for Linux, Windows, and Mac.
To make an interactive visualization, we will use the
matplotlib.pyplot.plot function toÂ display theÂ particles as pointsÂ and the
matplotlib.animation.FuncAnimation class to animate the evolution of theÂ particles over time.
visualize function takes a particle
ParticleSimulator instance as an argument and displays the trajectory in an animated plot. The steps necessary to displayÂ the particle trajectory using the
matplotlib toolsÂ are as follows:
- Set up the axes and use theÂ
plotfunction to display the particles.
plottakes a list of x and y coordinates.
- Write an initialization function,Â
init, and a function,
animate, that updates the x andÂ yÂ coordinates using the
- Create a
FuncAnimationinstance by passing the
animatefunctions plus theÂ
intervalÂ parameters, which specify the update interval, and
blit, which improves the update rate of the image.
- Run the animation with
Â Â Â from matplotlib import pyplot as plt Â Â Â from matplotlib import animation Â Â Â def visualize(simulator): Â Â Â X = [p.x for p in simulator.particles] Â Â Â Y = [p.y for p in simulator.particles] Â Â Â fig = plt.figure() Â Â Â ax = plt.subplot(111, aspect='equal') Â Â Â line, = ax.plot(X, Y, 'ro') Â Â Â Â Â Â # Axis limits Â Â Â plt.xlim(-1, 1) Â Â Â plt.ylim(-1, 1) Â Â Â # It will be run when the animation starts Â Â Â def init(): Â Â Â line.set_data(, ) Â Â Â return line, # The comma is important! Â Â Â def animate(i): Â Â Â # We let the particle evolve for 0.01 time units Â Â Â simulator.evolve(0.01) Â Â Â X = [p.x for p in simulator.particles] Â Â Â Y = [p.y for p in simulator.particles] Â Â Â line.set_data(X, Y) Â Â Â return line, Â Â Â # Call the animate function each 10 ms Â Â Â anim = animation.FuncAnimation(fig, Â Â Â animate, Â Â Â init_func=init, Â Â Â blit=True, Â Â Â interval=10) Â Â Â plt.show()
To test things out, we define a small function,Â
test_visualize, that animates a system of three particles rotating in different directions. Note that the third particle completes a round three times faster than the others:
Â Â Â def test_visualize(): Â Â Â particles = [Particle(0.3, 0.5, 1), Â Â Â Particle(0.0, -0.5, -1), Â Â Â Particle(-0.1, -0.4, 3)] Â Â Â simulator = ParticleSimulator(particles) Â Â Â visualize(simulator) Â Â Â if __name__ == '__main__': Â Â Â test_visualize()
test_visualize function is helpful to graphically understand the system time evolution. In the following section, we will write more test functions to properly verify program correctness and measure performance.
Now that we have a working simulator, we can start measuring our performance and tune-up our code so that theÂ simulator can handle as many particles as possible. As a first step, we will write a test and a benchmark.
We need a test that checks whether the results produced by the simulation are correct or not. Optimizing a program commonly requires employing multiple strategies; as we rewrite our code multiple times, bugs may easily be introduced. AÂ solid test suite ensures that the implementation is correct at every iteration so that we are free to go wild and try different things with the confidence that, if the test suite passes, the code will still work as expected.
Our test will take three particles, simulate themÂ forÂ 0.1 time units, and compare the results with those from a reference implementation. A good way to organize your tests is using a separate function for each different aspect (or unit) of your application. Since our current functionality is included in theÂ
evolve method, our function will be named
test_evolve. The following code shows the
test_evolve implementation. Note that, in this case, we compare floating point numbers up to a certainÂ precision through the
Â Â Â def test_evolve(): Â Â Â particles = [Particle( 0.3, 0.5, +1), Â Â Â Particle( 0.0, -0.5, -1), Â Â Â Particle(-0.1, -0.4, +3)] Â Â Â simulator = ParticleSimulator(particles) Â Â Â simulator.evolve(0.1) Â Â Â p0, p1, p2 = particles Â Â Â def fequal(a, b, eps=1e-5): Â Â Â return abs(a - b) < eps Â Â Â assert fequal(p0.x, 0.210269) Â Â Â assert fequal(p0.y, 0.543863) Â Â Â assert fequal(p1.x, -0.099334) Â Â Â assert fequal(p1.y, -0.490034) Â Â Â assert fequal(p2.x, 0.191358) Â Â Â assert fequal(p2.y, -0.365227) Â Â Â if __name__ == '__main__': Â Â Â test_evolve()
A testÂ ensures the correctness of our functionality but gives little information aboutÂ its running time. Â A benchmark is a simple and representative use case that can be run to assess the running time of an application. Benchmarks are very useful to keep score ofÂ how fast our programÂ is with each new version that we implement.
We can write a representative benchmark by instantiating a thousandÂ
Particle objects with random coordinates and angular velocity, and feed them to a
ParticleSimulator class. We then let the system evolve forÂ 0.1 time units:
Â Â Â from random import uniform Â Â Â def benchmark(): Â Â Â particles = [Particle(uniform(-1.0, 1.0), Â Â Â uniform(-1.0, 1.0), Â Â Â uniform(-1.0, 1.0)) Â Â Â for i in range(1000)] Â Â Â simulator = ParticleSimulator(particles) Â Â Â simulator.evolve(0.1) Â Â Â if __name__ == '__main__': Â Â Â benchmark()
A very simple way to time aÂ benchmark is through the Unix
time command. Using theÂ
timeÂ command, as follows, you can easily measure the execution time of an arbitrary process:
$ time python simul.py real 0m1.051s user 0m1.022s sys 0m0.028s
time command is not available for Windows. To install Unix tools, such as
time, on Windows you can use theÂ
cygwin shell, downloadable from the official website (http://www.cygwin.com/). Alternatively, you can use similar PowerShell commands, such as
Measure-Command (https://msdn.microsoft.com/en-us/powershell/reference/5.1/microsoft.powershell.utility/measure-command), to measure execution time.
timeÂ displays three metrics:
real: The actual time spent running the process from start to finish, as if it was measured by a human with a stopwatch
user: The cumulative time spent by all the CPUs during the computation
sys: The cumulative time spent by all the CPUs during system-related tasks, such as memory allocation
Note that sometimes
sys might be greater than
real, as multiple processors may work in parallel.
time also offers richerÂ formatting options. For an overview, you can explore its manual (using the
man time command). If you want a summary of all the metrics available, you can use the
time command is one of the simplest and more direct ways to benchmark a program. For anÂ accurate measurement, the benchmark should be designed to have a long enoughÂ execution time (in the order of seconds) so that theÂ setup and tear-down of the process is small compared to the execution time of the application. The
user metric is suitable as a monitor for the CPU performance, while theÂ
real metric also includes the time spent in other processes whileÂ waiting for I/O operations.
Another convenient way to time Python scripts is the
timeit module. This module runs a snippet of code in a loop for n times and measuresÂ the total execution times. Then, it repeats the same operation r times (by default, the value of r is
3) and records the time of the best run. Due toÂ this timing scheme,
timeitÂ is an appropriate tool to accurately time small statements in isolation.
timeit module can be used as a Python package, from the command line or from IPython.
IPython is a Python shell design that improves the interactivity of the Python interpreter. It boosts tab completion and many utilities to time, profile, and debug your code. We will use this shell to try out snippets throughout the book. The IPython shell accepts magic commands--statements that start with a
% symbol--that enhance the shell with special behaviors. Commands that start with
%% are called cell magics, whichÂ can be applied on multi-line snippets (termed asÂ cells).
IPython is available on most Linux distributions through
pipÂ and is included in Anaconda.
You can use IPython as a regular Python shell (
ipython), but it is also available in a Qt-based version (
ipython qtconsole) and as a powerful browser-based interface (
In IPython and command-line interfaces, it is possible to specify the number of loops or repetitions with theÂ
-r options. If not specified, they will be automatically inferred by
timeit. When invoking
timeit from the command line, you can also passÂ some setup code, through the
-s option, whichÂ will execute before the benchmark. In the following snippet, the IPython command line and Python module version of
timeit are demonstrated:Â
# IPython Interface $ ipython In : from simul import benchmark In : %timeit benchmark() 1 loops, best of 3: 782 ms per loop # Command Line Interface $ python -m timeit -s 'from simul import benchmark' 'benchmark()' 10 loops, best of 3: 826 msec per loop # Python Interface # put this function into the simul.py script import timeit result = timeit.timeit('benchmark()', setup='from __main__ import benchmark', number=10) # result is the time (in seconds) to run the whole loop result = timeit.repeat('benchmark()', setup='from __main__ import benchmark', number=10, repeat=3) # result is a list containing the time of each repetition (repeat=3 in this case)
Note that while the command line and IPython interfaces automatically infer a reasonable number of loops
n, the Python interface requires you to explicitly specify a value through theÂ
time command is a versatile tool that can be used to assess the running time of small programs on a variety of platforms. For larger Python applications and libraries, a more comprehensive solution that deals with both testing and benchmarking is
pytest, in combination with its
In this section, we will write a simple benchmark for our application using the
pytest testing framework. For the interested reader, the
pytest documentation, which can be found atÂ http://doc.pytest.org/en/latest/, is the best resource to learn more about the framework and its uses.
You can install
pytest from the console using the
pip install pytestÂ command. The benchmarking plugin can be installed, similarly, by issuing the
pip install pytest-benchmarkÂ command.
A testing framework is a set of tools that simplifies writing, executing, and debugging tests and provides rich reports and summaries of the test results. When using the
pytest framework, it is recommended to place tests separately from the application code. In the following example, we create the
test_simul.pyÂ file, which contains the
Â Â Â from simul import Particle, ParticleSimulator Â Â Â def test_evolve(): Â Â Â particles = [Particle( 0.3, 0.5, +1), Â Â Â Particle( 0.0, -0.5, -1), Â Â Â Particle(-0.1, -0.4, +3)] Â Â Â simulator = ParticleSimulator(particles) Â Â Â simulator.evolve(0.1) Â Â Â Â Â Â p0, p1, p2 = particles Â Â Â def fequal(a, b, eps=1e-5): Â Â Â return abs(a - b) < eps Â Â Â assert fequal(p0.x, 0.210269) Â Â Â assert fequal(p0.y, 0.543863) Â Â Â assert fequal(p1.x, -0.099334) Â Â Â assert fequal(p1.y, -0.490034) Â Â Â assert fequal(p2.x, 0.191358) Â Â Â assert fequal(p2.y, -0.365227)
pytest executable can be used from the command line to discover and run tests contained in Python modules. To execute a specific test, we can use the
pytest path/to/module.py::function_nameÂ syntax. To execute
test_evolve,Â we can type the following command in a console to obtain simple but informative output:
$ pytest test_simul.py::test_evolve platform linux -- Python 3.5.2, pytest-3.0.5, py-1.4.32, pluggy-0.4.0 rootdir: /home/gabriele/workspace/hiperf/chapter1, inifile: plugins: collected 2 items test_simul.py . =========================== 1 passed in 0.43 seconds ===========================
Once we have a test in place, it is possible for you to executeÂ your test as a benchmark using the
pytest-benchmark plugin. If we change our
test function so that it accepts an argument named
pytest framework will automatically pass the
benchmark resource as an argument (in
pytest terminology, these resources are called fixtures). The benchmark resource can be called by passing the function that we intend to benchmark as the first argument, followed by the additional arguments. In the following snippet, we illustrate the edits necessary to benchmark the
Â Â Â from simul import Particle, ParticleSimulator Â Â Â def test_evolve(benchmark): Â Â Â # ... previous code Â Â Â benchmark(simulator.evolve, 0.1)
To run the benchmark, it is sufficient to rerun the
pytest test_simul.py::test_evolveÂ command. The resulting output will contain detailed timing information regarding the
test_evolve function, as shown:
For each test collected,Â
pytest-benchmarkÂ will execute the benchmark function several times and provide a statistic summary of its running time. The output shown earlierÂ is very interesting as it shows how running times vary between runs. In this example, the benchmark in
test_evolve was run
34 times (column
Rounds), its timings ranged between
41 ms (
Max), and the
Median times were fairly similar at about
30 ms, which is actually very close to the best timing obtained. This example demonstrates how there can be substantial performance variability between runs, and that when taking timings with one-shot tools such as
time, it is a good idea to run the program multiple times and record a representative value, such as the minimum or the median.
pytest-benchmark has many more features and options that can be used to take accurate timings and analyze the results. For more information, consult the documentation at http://pytest-benchmark.readthedocs.io/en/stable/usage.html.
After assessing the correctness and timing the execution time of the program, we are ready to identify the parts of the code that need to be tuned for performance. Those parts are typically quite small compared to the size of the program.
TwoÂ profiling modules are available through the Python standard library:
profilemodule: This module is written in pure Python and adds a significant overhead to the program execution. Its presence in the standard library is because of its vast platform supportÂ and because it is easier to extend.
cProfilemodule: This is the main profiling module, with an interface equivalent to
profile. It is written in C, has a small overhead, and is suitable as a general purpose profiler.
cProfile module can be used in threeÂ different ways:
- From the command line
- As a Python module
- WithÂ IPython
cProfileÂ does not require any change in the source code and can be executed directly on an existing Python script or function.Â You can use
cProfile from the command line in this way:
$ python -m cProfile simul.py
This will print a long output containing several profiling metrics ofÂ all of the functions called in the application. You can use the
-s option to sort the output by a specific metric. In the following snippet ,the output is sorted by the
tottimeÂ metric, which will be described here:
$ python -m cProfile -s tottime simul.py
The data produced by
cProfile can be saved in anÂ output file by passing the
-o option. The format that
cProfile uses is readable by the
stats module and other tools. The usage of theÂ
-o option is as follows:
$ python -m cProfile -o prof.out simul.py
The usage of cProfile as a Python module requires invoking the
cProfile.runÂ function in the following way:
Â Â Â from simul import benchmark Â Â Â import cProfile Â Â Â cProfile.run("benchmark()")
You can also wrap a section of code between method calls of a
cProfile.Profile object, as shown:
Â Â Â from simul import benchmark Â Â Â import cProfile Â Â Â pr = cProfile.Profile() Â Â Â pr.enable() Â Â Â benchmark() Â Â Â pr.disable() Â Â Â pr.print_stats()
cProfileÂ can also be usedÂ interactively with IPython. The
%prun magic command lets you profile an individual function call, asÂ illustrated:
cProfile output is divided into five columns:
ncalls: The number of times the function was called.
tottime: The total time spent in the function without taking into account the calls to other functions.
cumtime: The time in the function including other function calls.
percall: The time spent for a single call of the function--it can be obtained by dividing the total or cumulative time by the number of calls.
filename:lineno: The filename and corresponding line numbers. This information is not availableÂ when calling C extensions modules.
The most important metric is
tottime, the actual time spent in the function body excluding subcalls, which tell us exactly where the bottleneck is.
Unsurprisingly, the largest portion of time is spent in the
evolve function. We can imagine that the loop is the section of the code that needsÂ performance tuning.
cProfile only provides information at the function level and does not tell us which specific statementsÂ are responsible for the bottleneck. Fortunately, as we will see in the next section, the Â
line_profilerÂ tool is capable of providing line-by-line information of the time spent in the function.
cProfileÂ text output can be daunting for big programs with a lot of calls and subcalls. Some visualÂ tools aid the task by improving navigation with an interactive, graphical interface.
KCachegrind is a Graphical User Interface (GUI)Â useful to analyze the profiling output emitted by
KCachegrind is available in the Ubuntu 16.04 official repositories. The Qt port, QCacheGrind, can be downloaded for Windows from http://sourceforge.net/projects/qcachegrindwin/.Â Mac users can compile QCacheGrind using Mac Ports (http://www.macports.org/) by following the instructions present in the blog post atÂ http://blogs.perl.org/users/rurban/2013/04/install-kachegrind-on-macosx-with-ports.html.
KCachegrind can't directly read the output files produced by
cProfile. Luckily, the
pyprof2calltree third-party Python module is able to convert the
cProfile output file into a format readable by KCachegrind.
You can install
pyprof2calltreeÂ from the Python Package IndexÂ using the command
pip install pyprof2calltree.
To best show the KCachegrind features, we will use another example with a more diversified structure. We define a
factorial, and two other functions that use
taylor_sin. They represent the polynomial coefficients of the Taylor approximations of
Â Â Â def factorial(n): Â Â Â if n == 0: Â Â Â return 1.0 Â Â Â else: Â Â Â return n * factorial(n-1) Â Â Â def taylor_exp(n): Â Â Â return [1.0/factorial(i) for i in range(n)] Â Â Â def taylor_sin(n): Â Â Â res =  Â Â Â for i in range(n): Â Â Â if i % 2 == 1: Â Â Â res.append((-1)**((i-1)/2)/float(factorial(i))) Â Â Â else: Â Â Â res.append(0.0) Â Â Â return res Â Â Â def benchmark(): Â Â Â taylor_exp(500) Â Â Â taylor_sin(500) Â Â Â if __name__ == '__main__': Â Â Â benchmark()
To access profile information, we first need to generate the
cProfile output file:
$ python -m cProfile -o prof.out taylor.py
Then, we can convert the output file with
pyprof2calltree and launch KCachegrind:
$ pyprof2calltree -i prof.out -o prof.calltree $ kcachegrind prof.calltree # or qcachegrind prof.calltree
The output is shown in the following screenshot:
The preceding screenshot showsÂ the KCachegrind user interface. On the left, we have an output fairly similar to
cProfile. The actual column names are slightly different:
Incl. translates to
Self translates to
tottime. The values are given in percentages by clicking on the
Relative button on the menu bar.Â By clicking on the column headers, you can sort them by the corresponding property.
On the top right, a click on the
Callee Map tab will display a diagram of the function costs. In the diagram, the time percentage spent by the function is proportional to the area of the rectangle. Rectangles can contain sub-rectangles that represent subcalls to other functions. In this case, we can easily see that there are two rectangles for the
factorial function. The one on the left corresponds to the calls made by
taylor_exp and the one on the right to the calls made by
On the bottom right, you can display another diagram, the call graph,Â by clicking on the
Call Graph tab. A call graph is a graphical representation of the calling relationship between the functions; each square represents a function and the arrows imply a calling relationship. For example,
500 times, and
taylor_sin callsÂ factorial
250 times. KCachegrind also detects recursive calls:
factorial calls itself
You can navigate to the
Call Graph or the
Caller Map tab by double-clicking on the rectangles; the interface will update accordingly, showing that the timing properties are relative to the selected function. For example, double-clicking on
taylor_exp will cause the graph to change, showing only the contribution ofÂ
taylor_expÂ to the total cost.
Gprof2Dot (https://github.com/jrfonseca/gprof2dot) is another popular tool used to produce call graphs. Starting from output files produced by one of the supported profilers, it will generate a
.dot diagram representing the call graph.
Now that we know which function we have to optimize, we can use the
line_profiler moduleÂ that provides information on how time is spent in a line-by-line fashion. This is very useful in situations where it's difficult to determine which statements are costly. The
line_profiler module is a third-party module that is available on the Python Package Index and can be installed by following the instructions atÂ https://github.com/rkern/line_profiler.
In order to use
line_profiler,Â we need to apply a
@profile decorator to the functions we intend to monitor. Note that you don't have to import the
profile function from another module as it gets injected in the global namespace when running the
kernprof.pyÂ profiling script. To produce profiling output for our program, we need to add the
@profile decorator to the
Â Â Â @profile Â Â Â def evolve(self, dt): Â Â Â # code
kernprof.py script will produce an output file and will print the result of the profiling on the standard output. We should run the script withÂ two options:
-lto use the
-vto immediately print the results on screen
The usage of
kernprof.py is illustrated in the following line of code:
$ kernprof.py -l -v simul.py
It is also possible to run the profiler in an IPython shell for interactive editing. You should first load the
line_profiler extension that will provide the
lprunÂ magic command. Using that command, you can avoid adding the
The output is quite intuitive and is divided into six columns:
Line #: The number of the line that was run
Hits: The number of times that line was run
Time: The execution time of the line in microseconds (
Per Hit: Time/hits
% Time: Fraction of the total time spent executing that line
Line Contents: The content of the line
By looking at the percentage column, we can get a pretty good idea of where the time is spent. In this case, there are a few statements in the
for loop body with a cost of around 10-20 percent each.
Now that we have identified where exactly our application is spending most of its time, we can makeÂ some changes and assess the change in performance.
There are different ways to tune up our pure Python code. The way that produces the most remarkable results is to improve the algorithms used. In this case, instead of calculating the velocity and adding small steps, it will be more efficient (and correct as it is not an approximation) to express the equations of motion in terms of radius,Â
r, and angle,Â
alpha, (instead of
y), and then calculate the points on a circle using the following equation:
Â Â Â x = r * cos(alpha) Â Â Â y = r * sin(alpha)
Another way lies in minimizing the number of instructions. For example, we can precalculate the
timestep * p.ang_vel factor that doesn't change with time. We can exchange the loop order (first we iterate on particles, then we iterate on time steps) and put the calculation of the factor outside the loop on the particles.
The line-by-line profiling also showed that even simple assignment operations can take a considerable amount of time. For example, the following statement takes more than 10Â percent of the total time:
Â Â Â v_x = (-p.y)/norm
We can improve the performance of the loop by reducingÂ the number of assignment operations performed. To do that, we can avoid intermediate variables by rewriting the expression into a single, slightly more complex statement (note that the right-hand side gets evaluated completely before being assigned to the variables):
Â Â Â p.x, p.y = p.x - t_x_ang*p.y/norm, p.y + t_x_ang * p.x/norm
This leads to the following code:
Â Â Â def evolve_fast(self, dt): Â Â Â timestep = 0.00001 Â Â Â nsteps = int(dt/timestep) Â Â Â # Loop order is changed Â Â Â for p in self.particles: Â Â Â t_x_ang = timestep * p.ang_vel Â Â Â for i in range(nsteps): Â Â Â norm = (p.x**2 + p.y**2)**0.5 Â Â Â p.x, p.y = (p.x - t_x_ang * p.y/norm, Â Â Â p.y + t_x_ang * p.x/norm)
After applying the changes, we should verify that the result is still the same by running our test. We can then compare the execution times using our benchmark:
$ time python simul.py # Performance Tuned real 0m0.756s user 0m0.714s sys 0m0.036s $ time python simul.py # Original real 0m0.863s user 0m0.831s sys 0m0.028s
As you can see, we obtained only a modest increment in speed by making a pure Python micro-optimization.
Sometimes it's not easy to estimate how many operations a Python statement will take. In this section, we will dig into theÂ Python internals to estimate the performance of individual statements. In the CPython interpreter, Python code is first converted to an intermediate representation, the bytecode, and then executed by the Python interpreter.
To inspect how the code is converted to bytecode, we can use the
dis Python module (
dis stands for disassemble). Its usage is really simple; all that is needed is to call the
dis.dis function on the
Â Â Â import dis Â Â Â from simul import ParticleSimulator Â Â Â dis.dis(ParticleSimulator.evolve)
This will print, for each line in the function, a list of bytecode instructions. For example, the
v_x = (-p.y)/norm statement is expanded in the following set of instructions:
Â Â Â 29 85 LOAD_FAST 5 (p) Â Â Â 88 LOAD_ATTR 4 (y) Â Â Â 91 UNARY_NEGATIVE Â Â Â 92 LOAD_FAST 6 (norm) Â Â Â 95 BINARY_TRUE_DIVIDE Â Â Â 96 STORE_FAST 7 (v_x)
LOAD_FAST loads a reference of the
p variable onto the stack andÂ
LOAD_ATTR loads the
y attribute of the item present on top of the stack. The other instructions,
BINARY_TRUE_DIVIDE,Â simply do arithmetic operations on top-of-stack items. Finally, the result is stored in
By analyzing the
dis output, we can see that the first version of the loop produces
51 bytecode instructions while the second gets converted intoÂ
dis module helps discover how the statements get converted and serves mainly as an exploration and learning tool of the Python bytecode representation.
To improve our performance even further, we can keep trying to figure out other approaches to reduce the amount of instructions. It's clear, however, that this approach is ultimately limited by the speed of the Python interpreter and it is probably not the right tool for the job. In the following chapters, we will see how to speed up interpreter-limitedÂ calculations by executing fast specialized versions written in a lower level language (such as C or Fortran).
In some cases, high memory usage constitutes an issue. For example, if we want to handle a huge number of particles, we will incur a memory overhead due to the creation of many
memory_profiler module summarizes, in a way similar to
line_profiler, the memory usage of the process.
memory_profiler package is also available on the Python Package Index. You should also install the
psutil module (https://github.com/giampaolo/psutil) as an optional dependency thatÂ will make
memory_profilerÂ considerably faster.
memory_profiler also requires the instrumentation of the source code by placingÂ a
@profile decorator on the function we intend to monitor. In our case, we want to analyze the
We can slightly change
benchmark to instantiate a considerable amount (
Particle instances and decrease the simulation time:
Â Â Â def benchmark_memory(): Â Â Â particles = [Particle(uniform(-1.0, 1.0), Â Â Â uniform(-1.0, 1.0), Â Â Â uniform(-1.0, 1.0)) Â Â Â for i in range(100000)] Â Â Â simulator = ParticleSimulator(particles) Â Â Â simulator.evolve(0.001)
We can use
memory_profiler from an IPython shell through the
%mprunÂ magic command as shown in the following screenshot:
It is possible to runÂ
memory_profiler from the shell using the
mprof run command after adding the
Increment column, we can see that 100,000
Particle objects take
23.7 MiB of memory.
1 MiB (mebibyte) is equivalent toÂ 1,048,576 bytes. It is different from 1 MB (megabyte), which is equivalent to 1,000,000 bytes.
We can use
__slots__ on the
Particle class to reduce its memory footprint. This feature saves some memory by avoiding storing the variables of the instance in an internal dictionary. This strategy, however,Â has a drawback--it prevents the addition of attributes other than the ones specified in
Â Â Â class Particle: Â Â Â __slots__ = ('x', 'y', 'ang_vel') Â Â Â def __init__(self, x, y, ang_vel): Â Â Â self.x = x Â Â Â self.y = y Â Â Â self.ang_vel = ang_vel
We can now rerun our benchmark to assess the change in memory consumption, the result is displayed in the following screenshot:
By rewriting the
Particle class using
__slots__, we can save about
10 MiB of memory.
In this chapter, we introduced the basic principles of optimization and applied those principles to aÂ test application. When optimizing, the first thing to do is test and identify the bottlenecks in the application. We saw how to write and time a benchmark using the
time Unix command, the Python
timeit module, and the full-fledged
pytest-benchmark package. We learned how to profile our application using
memory_profiler, and how to analyze and navigate the profiling data graphically with KCachegrind.
In the next chapter, we will explore how to improve performance using algorithms and data structures available in the Python standard library. We will cover scaling, sample usage of several data structures, and learn techniques such as caching and memoization.