Reader small image

You're reading from  Building Low Latency Applications with C++

Product typeBook
Published inJul 2023
PublisherPackt
ISBN-139781837639359
Edition1st Edition
Right arrow
Author (1)
Sourav Ghosh
Sourav Ghosh
author image
Sourav Ghosh

Sourav Ghosh has worked in several proprietary, high-frequency algorithmic trading firms over the last decade. He has built and deployed extremely low latency, high-throughput automated trading systems for trading exchanges around the world, across multiple asset classes. He specializes in statistical arbitrage market-making and pairs trading strategies with the most liquid global futures contracts. He is currently the vice president at an investment bank based in São Paulo, Brazil. He holds a master's in computer science from the University of Southern California. His areas of interest include computer architecture, FinTech, probability theory and stochastic processes, statistical learning and inference methods, and natural language processing.
Read more about Sourav Ghosh

Right arrow

Understanding requirements for latency-sensitive applications

In this section, we will discuss some concepts that are required to build an understanding of what metrics matter for latency-sensitive applications. First, let’s define clearly what latency means and what latency-sensitive applications are.

Latency is defined as the time delay between when a task is started to the time when the task is finished. By definition, any processing or work will incur some overhead or latency – that is, no system has zero latency unless the system does absolutely no work. The important detail here is that some systems might have latency that is an infinitesimal fraction of a millisecond and the tolerance for an additional microsecond there might be low.

Low latency applications are applications that execute tasks and respond or return results as quickly as possible. The point here is that reaction latency is an important criterion for such applications where higher latencies can degrade performance or even render an application completely useless. On the other hand, when such applications perform with the low latencies that are expected of them, they can beat the competition, run at maximum speed, achieve maximum throughput, or increase productivity and improve the user experience – depending on the application and business.

Low latency can be thought of as both a quantitative as well as a qualitative term. The quantitative aspect is pretty obvious, but the qualitative aspect might not necessarily be obvious. Depending on the context, architects and developers might be willing to accept higher latencies in some cases but be unwilling to accept an extra microsecond in some contexts. For instance, if a user refreshes a web page or they wait for a video to load, a few seconds of latency is quite acceptable. However, once the video loads and starts playing, it can no longer incur a few seconds of latency to render or display without negatively impacting the user experience. An extreme example is high-speed financial trading systems where a few extra microseconds can make a huge difference between a profitable firm and a firm that cannot compete at all.

In the following subsections, we will present some nomenclature that applies to low latency applications. It is important to understand these terms well so that we can continue our discussion on low latency applications, as we will refer to these concepts frequently. The concepts and terms we will discuss next are used to differentiate between different latency-sensitive applications, the measurement of latencies, and the requirements of these applications.

Understanding latency-sensitive versus latency-critical applications

There is a subtle but important difference between the terms latency-sensitive applications and latency-critical applications. A latency-sensitive application is one in which, as performance latencies are reduced, it improves the business impact or profitability. So, the system might still be functional and possibly profitable at higher performance latencies but can be significantly more profitable if latencies are reduced. Examples of such applications would be operating systems (OSes), web browsers, databases, and so on.

A latency-critical application, on the other hand, is one that fails completely if performance latency is higher than a certain threshold. The point here is that while latency-sensitive applications might only lose part of their profitability at higher latencies, latency-critical applications fail entirely at high enough latencies. Examples of such applications are traffic control systems, financial trading systems, autonomous vehicles, and some medical appliances.

Measuring latency

In this section, we will discuss different methods of measuring latency. The real difference between these methods comes down to what is considered the beginning of the processing task and what is the end of the processing task. Another approach would be the units of what we are measuring – time is the most common one but in some cases, CPU clock cycles can also be used if it comes down to instruction-level measurements. Let’s look at the different measurements next, but first, we present a diagram of a generic server-client system without diving into the specifics of the use case or transport protocols. This is because measuring latency is generic and applies to many different applications with this kind of server-client setup.

Figure 1.1 – A general server-client system with timestamps between different hops

Figure 1.1 – A general server-client system with timestamps between different hops

We present this diagram here because, in the next few subsections, we will define and understand latencies between the different hops on the round-trip path from the server client and back to the server.

Time to first byte

Time to first byte is measured as the time elapsed from when the sender sends the first byte of a request (or response) to the moment when the receiver receives the first byte. This typically (but not necessarily) applies to network links or systems where there are data transfer operations that are latency-sensitive. In Figure 1.1, time to first byte would be the difference between and

Round-trip time

Round-trip time (RTT) is the sum of the time it takes for a packet to travel from one process to another and then the time it takes for the response packet to reach the original process. Again, this is typically (but not necessarily) used for network traffic going back and forth between server and client processes, but can also be used for two processes communicating in general.

RTT, by default, includes the time taken by the server process to read, process, and respond to the request sent by the sender – that is, RTT generally includes server processing times. In the context of electronic trading, the true RTT latency is based on three components:

  • First, the time it takes for information from the exchange to reach the participant
  • Second, the time it takes for the execution of the algorithms to analyze the information and make a decision
  • Finally, the time it take for the decision to reach the exchange and get processed by the matching engine

We will discuss this more in the last section of this book, Analyzing and improving performance.

Tick-to-trade

Tick-to-trade (TTT) is similar to RTT and is a term most commonly used in electronic trading systems. TTT is defined as the time from when a packet (usually a market data packet) first hits a participant’s infrastructure (trading server) to the time when the participant is done processing the packet and sends a packet out (order request) to the trading exchange. So, TTT includes the time spent by the trading infrastructure to read the packet, process it, calculate trading signals, generate an order request in reaction to that, and put it on the wire. Putting it on the wire typically means writing something to a network socket. We will revisit this topic and explore it in greater detail in the last section of this book, Analyzing and improving performance. In Figure 1.1, TTT would be the difference between and .

CPU clock cycles

CPU clock cycles are basically the smallest increment of work that can be done by the CPU processor. In reality, they are the amount of time between two pulses of the oscillator that drives the CPU processor. Measuring CPU clock cycles is typically used to measure latency at the instruction level – that is, at an extremely low level at the processor level. C++ is both a low-level as well as a high-level language; it lets you get as close to the hardware as needed and also provides higher-level abstractions such as classes, templates, and so on. But generally, C++ developers do not spend a lot of time dealing with extremely low-level or possibly assembly code. This means that the compiled machine code might not be exactly what a C++ developer expects. Additionally, depending on the compiler versions, the processor architectures, and so on, there may be even more sources of differences. So, for extremely performance-sensitive low latency code, it is often not uncommon for engineers to measure how many instructions are executed and how many CPU clock cycles are required to do so. This level of optimization is typically the highest level of optimization possible, alongside kernel-level optimizations.

Now that we have seen some different methods of measuring latencies in different applications, in the next section, we will look at some latency summary metrics and how each one of them can be important under different scenarios.

Differentiating between latency metrics

The relative importance of a specific latency metric over the other depends on the application and the business itself. As an example, a latency-critical application such as an autonomous vehicle software system cares about peak latency much more than the mean latency. Low latency electronic trading systems typically care more about mean latency and smaller latency variance than they do about peak latency. Video streaming and playback applications might generally prioritize high throughput over lower latency variance due to the nature of the application and the consumers.

Throughput versus latency

Before we look at the metrics themselves, first, we need to clearly understand the difference between two terms – throughput and latency – which are very similar to each other and often used interchangeably but should not be. Throughput is defined as how much work gets done in a certain period of time, and latency is how quickly a single task is completed. To improve throughput, the usual approach is to introduce parallelism and add additional computing, memory, and networking resources. Note that each individual task might not be processed as quickly as possible, but overall, more tasks will be completed after a certain amount of time. This is because, while being processed individually, each task might take longer than in a low latency setup, but the parallelism boosts throughput over a set of tasks. Latency, on the other hand, is measured for each individual task from beginning to finish, even if fewer tasks are executed overall.

Mean latency

Mean latency is basically the expected average response time of a system. It is simply the average of all the latency measurement observations. This metric includes large outliers, so can be a noisy metric for systems that experience a large range of performance latencies.

Median latency

Median latency is typically a better metric for the expected response time of a system. Since it is the median of the latency measurement observations, it excludes the impact of large outliers. Due to this, it is sometimes preferred over the mean latency metric.

Peak latency

Peak latency is an important metric for systems where a single large outlier in performance can have a devastating impact on the system. Large values of peak latency can also significantly influence the mean latency metric of the system.

Latency variance

For systems that require a latency profile that is as deterministic as possible, the actual variance of the performance latency is an important metric. This is typically important where the expected latencies are quite predictable. For systems with low latency variance, the mean, median, and peak latencies are all expected to be quite close to each other.

Requirements of latency-sensitive applications

In this section, we will formally describe the behavior of latency-sensitive applications and the performance profile that these applications are expected to adhere to. Obviously, latency-sensitive applications need low latency performance, but here we will try to explore minor subtleties in the term low latency and discuss some different ways of looking at it.

Correctness and robustness

When we think of latency-sensitive applications, it is often the case that we think low latency is the single most important aspect of such applications. But in reality, a huge requirement of such applications is correctness and we mean very high levels of robustness and fault tolerance. Intuitively, this idea should make complete sense; these applications require very low latency to be successful, which then should tell you that these applications also have very high throughput and need to process huge amounts of inputs and produce a large number of outputs. Hence, the system needs to achieve very close to 100% correctness and be very robust as well for the application to be successful in their business area. Additionally, the correctness and robustness requirements need to be maintained as the application grows and changes during its lifetime.

Low latencies on average

This is the most obvious requirement when we think about latency-sensitive applications. The expected reaction or processing latency needs to be as low as possible for the application or business overall to succeed. Here, we care about the mean and median performance latency and need it to be as low as possible. By design, this means the system cannot have too many outliers or very high peaks in performance latency.

Capped peak latency

We use the term capped peak latency to refer to the requirement that there needs to be a well-defined upper threshold for the maximum possible latency the application can ever encounter. This behavior is important for all latency-sensitive applications, but most important for latency-critical applications. But even in the general case, applications that have extremely high-performance latency for a handful of cases will typically destroy the performance of the system. What this really means is that the application needs to handle any input, scenario, or sequence of events and do so within a low latency period. Of course, the performance to handle a very rare and specific scenario can possibly be much higher than the most likely case, but the point here is that it cannot be unbounded or unacceptable.

Predictable latency – low latency variance

Some applications prefer that the expected performance latency is predictable, even if that means sacrificing latency a little bit if the average latency metric is higher than it could be. What this really means is that such applications will make sure that the expected performance latency for all kinds of different inputs or events has as little variance as possible. It is impossible to achieve zero latency variance, but some choices can be made in data structures, algorithms, code implementation, and setup to try to minimize this as much as possible.

High throughput

As mentioned before, low latency and throughput are related but not identical. For that reason, sometimes some applications that need the highest throughput possible might have some differences in design and implementation to maximize throughput. The point is that maximizing throughput might come at the cost of sacrificing average performance latencies or increasing peak latencies to achieve that.

In this section, we introduced the concepts that apply to low latency application performance and the business impact of those metrics. We will need these concepts in the rest of the book when we refer to the performance of the applications we build. Next, we will move the conversation forward and explore the programming languages available for low latency application development. We will discuss the characteristics of the languages that support low latency applications and understand why C++ has risen to the top of the list when it comes to developing and improving latency-sensitive applications.

Previous PageNext Page
You have been reading a chapter from
Building Low Latency Applications with C++
Published in: Jul 2023Publisher: PacktISBN-13: 9781837639359
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at €14.99/month. Cancel anytime

Author (1)

author image
Sourav Ghosh

Sourav Ghosh has worked in several proprietary, high-frequency algorithmic trading firms over the last decade. He has built and deployed extremely low latency, high-throughput automated trading systems for trading exchanges around the world, across multiple asset classes. He specializes in statistical arbitrage market-making and pairs trading strategies with the most liquid global futures contracts. He is currently the vice president at an investment bank based in São Paulo, Brazil. He holds a master's in computer science from the University of Southern California. His areas of interest include computer architecture, FinTech, probability theory and stochastic processes, statistical learning and inference methods, and natural language processing.
Read more about Sourav Ghosh