Hands-On High Performance Programming with Qt 5

Understanding Performant Programs

In this introductory chapter, we'll start this book with some general discussions about program performance: why it's important, what are the factors that determine it, and how programmers generally go about performance themes. We'll begin with a broad discussion of performances' relevance in programming, before looking at some traditional performance-related knowledge, and we'll finish this chapter with the impact modern CPU architectures made in this field.

This chapter will therefore cover the following topics:

Why performance is important: To motivate ourselves before diving into technicalities
Traditional wisdom and basic guidelines: Old and proven performance knowledge
Modern processor architectures: At least the performance-relevant parts of it

Why performance is important

Maybe you just started reading this book out of curiosity and you're asking yourself this question: Why is performance important? Isn't that a thing of the past, when we didn't have enough CPU power and memory, and when our networks were grinding to a halt? In today's high-tech world, we have enough resources—the computers are so insanely fast!

Well, in principle, you're right to some degree, but consider the following:

A faster program runs more quickly, consuming less power along the way. This is good for the planet (if you're running it in a big server farm) and good for your user (if you're running it on a desktop computer).
A faster program means that it can serve more requests in the same time than a slower one. This is good for business, as you'll need to buy or lease fewer machines to serve your customers, and, again, it's good for the planet!
Faster software in today's business world's cut-throat competition means an advantage in respect to your competitors. This is nowhere more evident than in the world of automated trading (which is, by the way, dominated by C++), but also the fact that sluggishly-loading websites and programs needing an eternity to start won't be used that much!
And, lastly, especially on mobile devices, we still have to cope with constrained resources—network speed is finite (light speed) and battery life is finite too—and as faster programs use less resources, they're good for users!

Our quest for performance will hence pursue a three-pronged objective: to save the planet, to strengthen your business, and to make the life of users better—not a small feat I'd say!

The price of performance optimization

So, everything is rosy? Well, not quite, as there's a dark side to performance optimization too. If we'll try to squeeze the last drops of performance from our hardware, we can end up having unreadable, inflexible, unmaintainable, and hence outright ugly, code!

So, be aware that there are caveats and that there's a price to be paid. We must decide whether we want to pay it, and at what point we'll stop the optimization to save the clarity of our code.

Traditional wisdom and basic guidelines

When I started with programming (a long time ago), the pieces of advice about performance optimization traditionally given to a newbie were the following:

Don't do it (yet)
Premature optimization is the root of all evil
First make it run, then make it right, then make it fast

The first advice contained the yet only in its variant for the experts; the second was (and still is) normally misquoted, leaving out the "in say 97% of the cases" part, and the third quote gives you the impression that merely writing a program is already so difficult that fretting about performance is a luxury. It's no wonder then that the normal approach to performance was to fix it later!

But all of the adages nonetheless highlight an important insight—performance isn't distributed evenly through your code. The 80-20, or maybe even the 90-10 rule, applies here, because there are some hotspots where extreme care is needed, but we shouldn't try to optimize every nook and cranny in our code. So, our first guideline will be premature optimization—we should forget about it in, say, 95% of cases.

But what exactly are the 20%, 10%, or 5% of code where we shouldn't forget about it? Another old-age programming wisdom states this—programmers are notoriously bad at guessing performance bottlenecks.

So, we shouldn't try to predict the tight spot and measure the performance of a ready program instead. This does sound a lot like the fix it later cowboy coder's approach. Well, this book takes the stance that though premature optimization should be avoided, nonetheless, premature pessimization should be avoided at all costs, as it's even worse! However, avoiding premature pessimizations requires much detailed knowledge about which language constructs, which framework use cases, and which architectural decisions come with what kind of performance price tags. This book will try to provide this knowledge in the context of the Qt framework.

But, first, let's talk about quite general principles that address the question of what should be avoided, lest the performance degrades. As I see it, we can distill from the traditional performance wisdom from the following basic common-sense advice:

Don't do the same thing twice.
Don't do slow things often.
Don't copy data unnecessarily.

You'll agree that all that can't be good for performance? So, let's discuss these three simple but fundamental insights in some more detail.

Avoiding repeated computation

The techniques falling under the first point are concerned with unneeded repetition of work. The basic counter measure here is caching, that is, saving the results of computation for later use. A more extreme example of avoiding repletion of work is to precompute results even before their first usage. This is normally achieved by hand-coded (or generated by a script) precomputed tables or, if your programming language allows that, with compile-time computation. In the latter case, we sacrifice compilation times for better run-time performance. We'll have a look at C++ compile time techniques in Chapter 3, Deep Dive into C++ and Performance.

Choosing the optimal algorithm and data structure also falls into that realm, as different algorithms and data structures are optimized for different use cases, and you have to make your choice wisely. We'll have a look at some gotchas pertaining Qt's own data structures in Chapter 4, Using Data Structures and Algorithms Efficiently.

The very basic techniques such as pulling code out of a loop, such as the repeated computations or initializations of local variables, fall into that class as well, but I'm convinced you knew about this already.

Avoiding paying the high price

The techniques falling under the second point come into play if there's something we can't avoid doing, but it has a pretty high cost tagged on to it. An example of this is interaction with the operating system or hardware, such as writing data to a file or sending a packet over the network. In this case, we resort to batching, also known in I/O context as buffering—instead of writing or sending a couple of small chunks of data right away, we first gather them and then write or send them together to avoid paying the high cost each time.

On the other hand, we can apply techniques of this type too. In I/O or memory context, this would be the prefetching of data, also known as read-ahead. When reading data from a file, we read more than the user actually requested, hoping that the next portion of data will be needed soon. In the networking context, there are examples of speculative pre-resolving of Domain Name System (DNS) addresses when a user is hovering over a link in browsers or even pre-connecting to such addresses. However, such measures can turn into its counterpart when the prediction fails, and such techniques require very careful tuning!

Related techniques to be mentioned in this context are also avoidance of system calls and avoidance of locking to spare the costs of system call and switching to the kernel context.

We'll see some applications of such techniques in last chapters of the book when we discuss I/O, graphics , and networking.

Another example of when this rule can be used is memory management. General-purpose memory allocators tend to incur rather high costs on single allocations, so the remedy is to preallocate one big buffer at first and then use it for all needs of the program by managing it by ourselves using a custom allocation strategy. If we additionally know how big our objects are going to be, we can just allocate several buffer pools for different object sizes, making the custom allocation strategy rather simple. Preallocating memory at the start used to be a classic measure to improve the performance of memory intensive programs. We'll discuss these technical C++ details in Chapter 3, Deep Dive into C++ and Performance.

Avoiding copying data around

The techniques falling under the third point tend to be somehow of a lower-level nature. The first example is avoiding copying data when passing parameters to a function call. A suitable choice of data structure will avoid copying of data as well—just think about an automatically growing vector. In many cases, we can use preallocation techniques to prevent this (such as the reserve() method of std::vector) or choose a different data structure that will better match the intended use case.

Another common case when the copying of data can be a problem is string processing. Just adding two strings together will, in the naive implementation, allocate a new one and copy the contents of the two strings to be joined. And as much of programming contains some string manipulations, this can be a big problem indeed! The remedy for that could be using static string literals or just choosing a better library implementation for strings.

We'll discuss these themes in Chapter 3, Deep Dive into C++ and Performance, and Chapter 4, Using Data Structures and Algorithms Efficiently.

Another example of this optimization rule is the holy grail of network programming—the zero-copy sending and receiving of data. The idea is that data isn't copied between user buffers and network stack before sending it out. Most modern network hardware supports scatter-gather (also known as vectored I/O), where the data to be sent doesn't have to be provided in a single contiguous buffer but can be made available as a series of separate buffers.

In that way, a user's data doesn't have to be consolidated before sending, sparing us copying of data. The same principle can be applied to software APIs as well; for example, Facebook's recent TSL 1.3 implementation (codename Fizz, open sourced) supports scatter-gather API on library level!

General performance optimization approach

Up to now, we listed the following classic optimization techniques:

Optimal algorithms
Optimal data structures
Caching
Precomputed tables
Preallocation and custom allocators
Buffering and batching
Read-ahead
Copy avoidance
Finding a better library

With our current stand of knowledge, we can formulate the following general-performance optimization procedure:

Write your code, avoiding unnecessary pessimizations where it doesn't cost much, as in the following examples:
- Pass parameters by reference.
- Use reasonably good, widely known algorithms and data structures.
- Avoid copying data and unnecessary allocations.

This alone should give you a pretty decent baseline performance.

Measure the performance, find the tight spots, and use some of the standard techniques listed. Then, measure again and iterate. This step must be done if the performance of our program isn't satisfactory despite our sound programming practices. Unfortunately, we can't know or anticipate everything that will happen in the complex interplay of hardware and software—there can always be surprises waiting for us.
If you still can't achieve good performance, then your hardware is probably too slow. Even with performance optimization techniques, we still can't do magic, sorry!

The preceding advice looks quite reasonable, and you might ask: Are we done? That wasn't that scary! Unfortunately, it's not the whole story. Enter the leaky abstraction of modern processor architectures.

Modern processor architectures

All the classic performance advice and algorithmic foo stems from the times of simple CPU setups, where processor and memory speeds were roughly equal. But then the processor speeds exploded by increasing quite faithfully to the Moore law by 60% per year where memory access times increased by only 10% and couldn't quite hold pace with them. The problem is that the main memory (dynamic random-access memory (DRAM), contains minuscule capacitors keeping an electrical charge to indicate the 1 bit and none to indicate the 0 bit. This results in an inexpensive circuitry that doesn't have to be kept under voltage but is working basically in the analog realm and can't profit that much from advances made in the digital components.

The second change that occurred since then was the demise of Moore's law in its simple form. Up to the early 2000s, CPU manufacturers steadily increased processor frequency rates, making CPUs run faster and faster. That was achieved by increasing the number of transistors packed on chips, and Moore's law predicted that number of transistors that can be packed on a chip will double every 18 months. In simple terms, it was understood as doubling the processor speed every two years.

This trend continued until processor manufacturers hit a physical barrier, the so-called power wall—at some point, the densely packed transistors produced so much heat that they couldn't be effectively cooled on consumer machines (high-end, expensive water-cooling systems are, too expensive for a laptop or a mobile device), so a different approach to increasing a CPU's performance had to be found.

Caches

The attempts to overcome these problems led to a slew of architectural innovations. First, the impedance between CPU and memory speeds was fought using a classic optimization technique we already know, namely, caching, on chip level. The on-chip static RAM (SRAM) memory requires six transistors (forming a flip-flop) per bit, and all of them must be kept under voltage. This means it's expensive and it drives the power consumption up (take care not to hit that power wall!). In exchange, the memory access times are at lightning speed, as all that is needed is to apply the current to the input and read the output.

So, the idea is to add a small caching stage of expensive but fast on-chip memory in front of big, slow but inexpensive main memory. Meanwhile, modern CPUs can command up to three levels of caches, commonly denoted with L1, L2, and L3 acronyms, decreasing in density and speed but increasing in size as the cache level goes up. The figure below shows us an overview of the memory hierarchy typically found in modern CPUs:

As of the time of this writing, access times for L1 caches are in the order of 3 cycles, L2 of 12 cycles, L3 of 38 cycles, and main memory is around 100-300 cycles. The main memory access time is that high because the analog nature of DRAM requires, among other things, periodic charge refreshing, pre-charging of the read line before reading, analog-digital conversion, communication through memory controller unit (MCU), and so on.

Caches are organized in cache lines, which on the current Intel architectures are 64 bytes long. Each cache update will hence fetch the entire cache line from main memory, doing a kind of prefetching already at that level. Speaking about prefetching, Intel processors have a special prefetch instruction we can invoke in assembler code for very low-level optimizations.

In addition to data caches, there's also an instruction cache, because in the von Neumann architecture, both are kept in the common memory. The instruction caches were added to Intel Pentium Pro (P6) as an experiment, but they were never removed since then.

Pipelining

Another possibility to increase a processor's speed is the instruction level parallelism (ILP), also known as superscalar computation.

The processing of a CPU instruction can internally be split into several stages, such as instruction fetch, decode, execute, and write-back. Before the Intel 486 processor, each instruction has to be finished before the next can be started. With pipelining, when the first stage of an instruction is ready, that instruction can be forwarded to the next stage, and the next instruction's processing can begin with its first stage. In that manner, several instructions can be in flight in parallel, keeping a processor's resources optimally utilized. The next screenshot illustrates this principle, graphically, using a hypothetical four-stages pipeline:

The original Intel 486 pipeline was five stages long, but on modern processors, it can be much longer. For example, the current Intel Atom processors command a pipeline of 16 stages.

That's all well and good, but, unfortunately, there are some problems lurking in the corners.

Speculative execution and branch prediction

As long as the pipeline is pumping, everything is OK. But what if we encounter a conditional branch? The pipeline has to wait till the result of the test is known, hence the next instruction can be started only when the current one has finished. Welcome to the pre-80,486 world! This is called a pipeline stall and defeats the whole purpose of pipelining. And because every program is literally dotted with if-then-else clauses, something must be done here.

The solution manufacturers came up with was speculative execution: instead of idling, we just start executing one of the branches speculatively. If we are lucky, we've just done the right thing, but, if not, we discard our speculative work, and we are on even ground with the pipeline stall case. As we decide randomly, we'll be right 50% of the time, and we just seriously increased the throughput of the pipeline!

The only problem is that the branches of the if clause are not equally probable! In most cases, they're even highly unevenly distributed: one of them is the error branch; the other is the normal case branch. But the processor doesn't know the meaning of the test, so what can we do? The solution to that is branch predicting—the processor is learning about branches in your code and can predict which branch will be taken on a given condition rather well.

This got complicated quickly, didn't it? If you're thinking it, you're not alone. Not so long ago, the programming world was shaken by disclosure of the Spectre and Meltdown vulnerabilities, which allowed the attacker to see contents of the memory regions where they don't have access rights. The first part of the exploit's to fool the branch predictor to take the false branch for speculative execution. After the processor sees a disallowed access, the instruction will be retired, but the protected data will be present in the cache, where they can be guessed with some complicated techniques we won't discuss here. These bugs basically put in question the processor optimizations of the last decade, as fixing them would incur meaningful performance losses.

Considering that, we are all rather curious about how CPU architectures evolve next time, aren't we?

Out-of-order execution

There's another refinement to the pipeline concept allowing an even higher utilization of a CPU's resources. Namely, as processor manufacturers started to add redundant processing units (Intel P6 already had two integer and two floating-point execution units), it became possible to execute two instructions in parallel.

Up until Pentium Pro (P6), instructions were fed into the pipeline in their order of appearance. But if there's a data dependency between two consecutive instructions, then they can't be processed in parallel, leaving the additional execution unit idle:

a = b + 1;  // 1
c = a + 5;  // 2
d = e + 10; // 3
f = d + 15; // 4

The solution to this problem is to take the next independent instruction and execute it before the dependent one. See the next diagram for a visual explanation:

Here, on the left side, we see the traditional execution preserving the instruction order and, on the right, parallel execution with reordering, where instruction 3 will be executed before instruction 2.

Multicore

The problem of the power wall was in the end overcome by freezing or even decreasing CPU frequencies but introducing parallelly working processor cores, adding more general registers, vector processing single instruction multiple data (SIMD) registers, and instructions. In a word, they either duplicated active processing units or added more elements that don't have to be always under voltage. In that way, the density of the transistors didn't have to be increased.

When all of the CPU cores are placed on one chip, we have a symmetric multiprocessing (SMP) CPU, because all cores can access their respective data within a local chip. The counterpart to that is a non-uniform memory access (NUMA) system, where we have several physically separate CPUs having their own internal caches. The memory access for internal CPUs will then be much cheaper than the memory access to an external CPU. Another problem is the cache coherence between the CPUs, which requires complicated cache-coherence protocols and can take down performance. In the context of Qt's application area, we normally encounter SMP machines, so we'll ignore NUMA in this book.

In a multicore chip, the processing resources can be classified in core and uncore ones—those which are duplicated for each core, and those that aren't and must be shared. For example, the top-level cache (L3 or L2, depending on the processor) is an uncore resource shared among processor cores.

One often-encountered notion is that of hyperthreading. This is another idea for increasing a CPU's parallelism, and hence its resource utilization. A processor with hyperthreading consists of two logical processors per core, each of which keeps its own internal state. The parts of the processor where it holds its architectural state (for example, running, interrupted, and halted) will be duplicated for each core, but the computation resources will be shared among logical cores. The intent of that is to increase the utilization of the processor's resources and prevent pipeline stalls, by borrowing resources from the stalled logical core. The operating system then uses these two logical cores as physical ones and has to be HTT-aware to optimally use such a system.

Strictly speaking, graphical-processing units (GPUs) are not a form of multicore, but we'll mention them in this context, because processors can ship with integrated GPUs on board. A GPU comprises many very simple processing units that can run massively parallel, although simple, computations. Normally, they're used to accelerate graphic processing, but they can be also used to speed up in general computations, such as the training of neural networks in deep-learning applications. In a Qt context, however, we use GPUs only for their graphical capabilities.

Additional instruction sets

As already mentioned, to increase processor performance chip, manufacturers started to add more sophisticated instructions that can either vectorize computations or execute algorithms that hitherto had to be implemented in application code.

The SIMD or vector instructions can be used to parallelize a scalar computation by executing calculations on several scalar values in parallel. For that, we have to load several float values in two sets of SIMD registers and then apply an operation to all of them at once. Intel processors introduced SIMD in a series of extensions named, namely the following:

Streaming SIMD Extension (SSE): Uses 128-bit registers and is available in several versions from SSE, over SSE2 to SSE4
Advanced Vector Extension (AVX): Uses 265-bit registers and is available in AVX2 and AVX-512 (for 512 bit) versions

As for the more specialized instructions, we mention the following Intel extensions:

Advanced Encryption Standard New Instructions (AES-NI): This implements the cryptographic AES encoding standard.
32-bit cyclic redundancy check (CRC32): This implements computation of the CRC32 correction code.
SSE4.2: This SSE extension also implements basic string operations using SIMD registers.

Impact on performance

As you see, a modern CPU doesn't look like your daddy's old-fashioned von Neuman CPU at all. It's a very complex computer system on its own, banned on a very small chip, desperately trying out all possible tricks to squeeze out a little bit more performance from your program.

On the other hand, modern CPUs are trying to appear simple to the programmer, unfortunately leaking the implementation specifics all of the time along the way. So, what changed since the days of classic performance advice we summed up in the preceding sections?

Fortunately, not much. The basic techniques and guidelines still apply, but some additional advice must be heeded. Let's discuss these new points.

Keeping your caches hot

First of all, don't thrash your caches because main memory is very slow. This means that reading data all over the memory (also known as pointer chasing) isn't the best of ideas. Modern processors, such as programs that read consecutive chunks of memory in a predictable manner, allow them to leverage hardware-level prefetching. The word here is data locality.

A negative example for that, alas, is our old and trusty linked list! Traversing it's a real pointer-chasing feast, because all of the nodes are allocated dynamically and can be placed anywhere in memory. However, we could remedy it by using the already-mentioned techniques of preallocating and custom memory allocators. In that case, all of the nodes of a list would be located in the adjacent element of the preallocated buffer, making it again more cache friendly. Of course, we could just use an array from the start, but this is an example of how to improve data locality in similar cases.

The second classic mistake would be placing two often used integers, x and y, well apart from each other in a data structure, so that when we want to use both, we need to needlessly load two cache lines instead of one. What's used together should stay together; don't break your data structures on a cache line boundary!

Further optimization that's widely known is the replacement of an array of structures by a structure of arrays. This will be beneficial in the case of loading data for SIMD instructions and other techniques where we read data in parallel.

Another often-cited optimization trick is improving the performance of matrix multiplication by changing the data-access pattern. Instead of a straightforward triple i, j, k loop, we change it to the k, j, i loop:

for (size_t k = 0; k < P; ++k)
  for (size_t j = 0; j <M; ++j)
    for (size_t i = 0; i < N; ++i)
      res[i][[j] += m1[i][[k] * m2[k][[j];

Here, just switching the order of traversal to be more cache-friendly and reading row elements of both matrices in consecutive manner will dramatically improve performance (in some measurements, up to 94%)!

What we learn here is that the outline and structure of our data, but also its access patterns, can have a big impact on our program's performance on modern processors!

The second type of cache, that is, the instruction cache, needs some love as well. Jumping through code back and forth is equally bad, as it's in the case of your data. This code locality is the next important notion. One possibility to influence that is to place your normal case code before the handling error, like this:

if (ok) {
  do_work();
} else {
  printf(“ERROR!”);
  return;
}

This avoids a jump in normal case, hence improving code locality. We'll discuss that in more detail in Chapter 3, Deep Dive into C++ and Performance, when we'll look at the role of the compiler in program optimization.

Don't confuse your branch predictor

To avoid pipeline stalls, preferably, there shouldn't be any branches at all. Unfortunately, that can't be done in programming, but the next best thing we can do is to minimize branching. A classic example of branch avoidance is replacing simple conditional expressions with bit manipulations, like this:

const int maxValue = 16;
if (x >= maxValue) x = 0;
// is equivalent to:
x = (x + 1) & (maxValue - 1);

I think we can agree, these are ugly, low-level tricks, best left to the micro-optimization stage or to the compiler. Another technique of that type is loop unrolling.

A more usable technique could be helping the compiler to generate more branch predictor-friendly code such as the Linux kernel practice of macros: likely() and unlikely(), which internally use the GNU compiler collection (GCC) compiler's __builtin_expect() directive. This is supposed to support the branch-predictor, but, in reality, it allows code reordering by the compiler, which will then enable different optimizations. Older architectures with static branch predictors use special instruction prefixes indicating whether the branch is likely to be taken or not. As newer architectures are using dynamic predictors, these prefixes will be ignored and don't take any effect.

So, the received wisdom is that the branch predictors have meanwhile gotten pretty good, and that we should try to mess with them in the rarest cases only. With the exception of one thing: as a branch predictor's table has a finite (small) size, you shouldn't thrash it! And here we are again: don't use too many branches; keep good code locality!

Parallelizing your application

This is the famous no free lunch conundrum: if we want our program to run faster, it's not enough to buy a faster processor—we have to restructure it to be parallel and to use more processor cores! We'll have a closer look at these techniques in Chapter 5, An In-Depth Guide to Concurrency and Multithreading, when we'll discuss multithreading.

As for vector processing, we have to notice that, for SIMD instructions to be performant, the loaded data usually has to be aligned. Apart from that, as of today, any decent modern compiler will try to vectorize the code, so normally, we don't have to bother. However, the compiler will generate code for a specific architecture, but maybe you'd like to run your program on several processor and look for the more advanced extended instruction sets available? This is possible, and techniques for doing that are known as CPU dispatching.

The second theme related to parallelism to be mentioned here is the out-of-order execution. First, sometimes, we can encounter advice to break too long dependency chains so as to allow reordering, as was shown in a previous diagram. Arguably, this is a low-level technique, but sometimes every nanosecond may count.

Another theme is that there could be a time where we would like to disable instruction reordering. What could it be? Right—when synchronizing among threads, we have to know exactly in which order which variables were read or written. On the processor level, this can be forced with memory barriers, but this will prevent the possible reordering optimizations. That is the reason synchronization in a multithreaded program is already expensive on the processor level.