Identifying memory bottlenecks
Most code is limited in speed by the rate at which data can be moved around. This can manifest in various ways, including TLB misses (recall that the TLB or translational lookaside buffer is a cache of pages that have been accessed recently), misses in the various levels of cache, or bad branch predictions and prefetch operations. perf will sometimes report these kinds of issues as being backend-bound.
To demonstrate the use of these tools, we need an example problem to work on. The example we’re going to look at in this section is the problem of computing the outer product of two vectors, which is an entirely memory-bound operation, and there is very little compute involved. The basic function is defined as follows, and we can adjust the lengths of the vectors to see different problematic behaviors:
inline void outer_product(float* z,
const float* x,
const float* y,
...