Clojure is a functional programming language that brings great power and simplicity to the user. Clojure is also dynamically typed and has very good performance characteristics. Naturally, every activity performed in a computer has an associated cost. What constitutes acceptable performance varies from one use case and workload to another. In today's world, performance is the determining factor for several kinds of applications. We will discuss Clojure (which runs on the Java Virtual Machine) and its runtime environment in the context of performance, which is the goal of this book.
The performance of Clojure applications depends on various factors. For a given application, understanding its use cases, design and implementation, algorithms, resource requirements and alignment with the hardware, and underlying software capabilities are essential. In this chapter, we will study the basics of performance analysis which includes the following:
A whirlwind tour of how the application stack impacts performance
Classifying performance anticipations by use cases types
Outlining the structured approach to analyze performance
A glossary of terms commonly used to discuss performance aspects
Performance numbers every programmer should know
Performance requirements and priority vary across different kinds of use cases. We need to determine what constitutes acceptable performance for various kinds of use cases. Hence, we classify them to identify their performance model. When it comes to details, there is no sure fire performance recipe for any kind of use case, but it certainly helps to study their general nature. Note that in real life, the use cases listed in this section may overlap each other.
The performance of user-facing applications is strongly linked to the user's anticipation. The difference of a good number of milliseconds may not be perceptible by the user, but at the same time, a wait of more than a few seconds may not be taken kindly. One important element to normalize the anticipation is to engage the user by providing duration-based feedback. A good idea to deal with such a scenario would be to start the task asynchronously in the background and poll it from the UI layer to generate duration-based feedback for the user. Another way could be to incrementally render the results to the user to even out the anticipation.
Anticipation is not the only factor in user-facing performance. Common techniques such as staging or pre-computation of data and other general optimization techniques can go a long way to improve the user experience with respect to performance. Bear in mind that all kinds of user-facing interfaces fall into this use case category: web, mobile web, GUI, command-line, touch, voice-operated, and gestures.
Non-trivial compute-intensive tasks demand a proportional amount of computational resources. All of the CPU, cache, memory, efficiency, and parallelizability of the computation algorithms would be involved in determining the performance. When the computation is combined with distribution over a network, or when reading from / staging to disk, I/O bound factors come into play. This class of workloads can be further subclassified into more specific use cases.
A CPU bound computation is limited by the CPU cycles spent on executing it. Processing arithmetic in a loop, small matrix multiplication, determining whether a number is Mersenne Prime, and so on would be considered CPU bound jobs. If the algorithm complexity is linked to N, such as O(N) and O(N2), then performance depends on how big N is and how many CPU cycles each step takes. For parallelizable algorithms, performance of such tasks may be enhanced by assigning multiple CPU cores to the task. On virtual hardware, performance may be impacted if CPU cycles are available in bursts.
A memory bound task is limited by the availability and bandwidth of a computer memory; examples include large text processing, list processing, and so on. Note that higher CPU resources cannot help when memory is in the bottleneck and vice versa. Lack of availability of memory may force you to process smaller chunks of data at a time, even if you have enough CPU resources at your disposal. If the maximum speed of your memory is X and your algorithm on single CPU-core accesses memory at a speed of X/3, the multicore performance of your algorithm cannot exceed 3 times the current performance, no matter how many CPU cores you assign to it. Memory architecture, for example SMP and NUMA, contributes to the memory bandwidth in multicore computers. Performance with respect to memory is also subject to page faults.
A task is cache bound when its speed is constrained by the amount of cache available. When a task retrieves values from a small number of repeated memory locations, for example small matrix multiplication, the values may be cached and fetched from there.
Note
Typically, CPUs have multiple layers of cache, and the performance will be at its best when the processed data fits in the cache. Processing will still happen, albeit slower, when the data does not fit into the cache . These will be covered in greater details in Chapter 4, Host Performance.
It is possible to make the most of the cache using cache-oblivious algorithms. A higher number of concurrent cache / memory bound threads than CPU cores is likely to flush the instruction pipeline, as well as the cache, at the time of a context switch.
An I/O bound task would go faster if the I/O subsystem it depends on goes faster. Disk or storage as well as network are the most commonly used I/O subsystems in data processing. Other I/O devices are serial ports, a USB-connected card readers, and so on. An I/O bound task may consume very few CPU cycles. Depending on the speed of the device, connection pooling, data compression, asynchronous handling, caching, and so on may help in performance. One notable aspect of I/O bound tasks is that the performance is usually dependent on the time spent waiting for connection (or disk seek) and the amount of serialization we do, but hardly on the other resources.
In practice, many data processing workloads are usually a combination of CPU bound, memory bound, cache bound, and I/O bound tasks. The performance of such mixed workloads effectively depends on the even distribution of CPU, cache, memory, and I/O resources over the duration of the operation. While all system resources are finite, some I/O resources may be particularly limited in bandwidth and latency. A bottleneck situation arises only when one resource gets too busy to make way for another.
OLTP systems process business transactions on demand. It could work as a backend system for a user-facing ATM machine, a point-of-sale terminal, a network-connected ticket counter, an ERP system, and so on. OLTP systems are characterized by low latency, availability, and data integrity. OLTP systems run day-to-day business transactions. Any interruption or outage is likely to have a direct and immediate impact on the sales or service. Such systems are expected to be designed for resiliency rather than delayed recovery from failures. When the performance objective is unspecified, you may want to consider graceful degradation as a strategy.
It is a common mistake to ask OLTP systems to answer analytical queries, something that they are not optimized for. It is desirable of an informed programmer to know the capability of the system and suggest design changes as per the requirements.
OLAP systems are designed to answer analytical queries in a short time. They typically get data from OLTP operations and their data model is optimized for querying. OLAP systems basically provide for consolidation (roll-up), drill-down, and slicing and dicing of data for analytical purposes. They often use specialized data stores that can optimize ad-hoc analytical queries on the fly. It is important for such databases to provide pivot-table-like capability. Often, an OLAP cube is used to get faster access to analytical data.
Feeding OLTP data into OLAP systems may entail workflows and multistage batch processing. The performance concern of such systems is to efficiently deal with large quantities of data while also dealing with inevitable failures and recovery.
Batch processing is the automated execution of predefined jobs. These are typically bulk jobs and are executed during off-peak hours. Batch processing may involve one or more stages of job processing. Often, batch processing is clubbed with workflow automation, where some workflow steps are executed offline. Many of the batch processing tasks work on staging and preparing data for the next stage of processing to pick up.
Batch jobs are generally optimized for the utmost utilization of computing resources. Since there is little to moderate demand to lower latencies of particular subtasks, these systems tend to optimize for throughput. A lot of batch jobs involve large I/O processing, and they are often distributed over a cluster. Due to distribution, data locality is preferred when processing the jobs; that is, data and processing should be local in order to avoid network latency in reading/writing data.
In practice, the performance of non-trivial applications is rarely a function of coincidence or prediction. For many projects, performance is not an option but rather compulsory, which is why this is even more important today. Capacity planning, determining performance objectives, performance modeling, measurement, and monitoring are crucial to achieving performance..
Tuning a poorly-designed system to perform as well as a system that is a well-designed system from the ground up is significantly hard, if not practically impossible. In order to meet a performance goal, performance objectives should be known before the application is designed. Performance objectives are stated in terms of latency, throughput, resource utilization, and workload. These terms are discussed in the Performance vocabulary section in this chapter.
The resource cost can be identified in terms of application scenarios, such as browsing of products, adding products to the shopping cart, and checkout. Creating workload profiles that represent users performing various operations is usually helpful.
Performance modeling is a reality check of whether the application design would support the performance objectives. It includes performance objectives, application scenarios, constraints, measurements (benchmark result), workload objectives, and, if available, the performance baseline. It is not a replacement of measurement and load testing, rather, the model is validated using these. The performance model may include performance test cases to assert the performance characteristics of the application scenarios.
Deploying an application to production almost always needs some form of capacity planning. It has to take into account the performance objectives for today and the foreseeable future. It requires an idea of application architecture and an understanding of how the external factors translate into internal workload. It also requires informed expectations about the responsiveness and the level of service to be provided by the system. Often, capacity planning is done early in a project to mitigate the risk of provisioning delays.
There are several technical terms that are heavily used in performance engineering. It is important to understand them as they form the cornerstone of performance related discussions. Collectively, these terms form a performance vocabulary. Performance is usually measured in terms of several parameters where every parameter has roles to play; such parameters are part of the vocabulary.
Latency is the time taken by an individual unit of work to complete a task. It does not imply successful completion of a task. Latency is not collective; it is linked to a particular task. If two similar jobs, j1 and j2, took 3ms and 5ms respectively, their latencies would be treated as such. If j1 and j2 were dissimilar tasks, it would have made no difference. In many cases, average latency of similar jobs is used in performance objectives, measuring, and monitoring results.
Latency is an important indicator of the health of a system. A high performance system often thrives on low latency. Higher than normal latency can be caused due to load or a bottleneck. It helps to measure the latency distribution during a load test. For example, if more than 25 percent of similar jobs under a similar load have significantly higher latency than others, it may be an indicator of a bottleneck scenario worth investigating.
When a task, j1, consists of smaller tasks, say j2, j3, and j4, the latency of j1 is not necessarily the sum of latencies of each of the j2, j3, and j4 tasks. If any of the subtasks of j1 are concurrent with another, the latency of j1 will turn out to be less than the sum of the latencies of j2, j3, and j4. I/O bound tasks are generally more prone to higher latency. In network systems, latency is commonly based on the roundtrip to another host, including latency from source to destination and then back to source.
Throughput is the number of successful tasks or operations performed in a unit of time. The top-level operations performed in a unit of time are usually of a similar kind but with potentially different latencies. So, what does throughput tell us about the system? It is the rate at which the system is performing. When you perform load testing, you can determine the maximum rate at which a particular system can perform. However, this is not a guarantee of conclusive overall maximum rate of performance.
Throughput is one of the factors that determine the scalability of a system. Throughput of a higher level task depends on the capacity to spawn off multiple such tasks in parallel and also depends on average latency of the tasks. Throughput should be measured during load testing and performance monitoring to determine peak measured throughput and maximum sustained throughput. These factors contribute to the scale and performance of a system.
Bandwidth is the raw data rate over a communication channel measured in a certain number of bits per second. This includes not only the payload but all the overhead necessary to carry out the communication. A few examples are Kbits/sec, Mbits/sec, and so on. An uppercase B in KB/sec denotes 'Bytes', as in Kilo Bytes per second. Bandwidth is often compared to throughput. While bandwidth is the raw capacity, throughput for the same system is the successful task completion rate that usually involves a roundtrip. Note that throughput is for an operation which involves latency. To achieve maximum throughput for a given bandwidth, the communication/protocol overhead and operational latency should be minimal.
For storage systems (such as hard disks and solid-state drives), the predominant way to measure performance is IOPS (Input-output per second), which is multiplied by the transfer-size and represented as Bytes-per-second, or further into MB/sec, GB/sec, and so on. IOPS is usually derived for sequential and random workloads for read/write operations.
Mapping the throughput of a system to the bandwidth of another may lead to dealing with the impedance mismatch between the two. For example, an order processing system may transact to the database on disk and post results over the network to an external system.
Depending on the bandwidth of the disk subsystem, the bandwidth of the network, and the execution model of the order, processing the throughput may depend not only on the bandwidth of the disk subsystem and network, but also on how loaded they currently are. Parallelism and pipelining are common ways to increase throughput over a given bandwidth.
Performance baseline, or simply baseline, is the reference point including measurements of well characterized and understood performance parameters for a known configuration. Baseline is used to collect performance measurements for the same parameters which we may benchmark later for another configuration. For example, collecting "throughput distribution over 10 minutes at a load of 50 concurrent threads" is one such performance parameter we can use for baseline and benchmarking. A baseline is recorded together with the hardware, network, OS, and system configuration.
Performance benchmark, or simply benchmark, is the recording of performance parameter measurements under various test conditions. A benchmark can be composed as a performance test suite. A benchmark may collect a small to large amount of data, and may take a varying duration depending on use cases, scenarios, and environment characteristics.
Baseline is a result of a benchmark that was conducted at one point of time; however, benchmark is independent of baseline.
Performance profiling, or simply profiling, is the analysis of the execution of a program at its runtime. A program can perform poorly for a variety of reasons. A profiler can analyze and find out the execution time of various parts of the program. It is possible to interleave statements in a program manually to print execution time of blocks of code, but this gets very cumbersome as you try to refine the code iteratively. A profiler is of great assistance to the developer.
Going by how profilers work, they are of three major kinds: instrumenting, sampling, and event-based. The event-based profilers work only for selected language platforms, and they provide a good balance between overhead and results; for example, Java supports event-based profiling via the JVMTI interface. Instrumenting profilers modify code at either compile time or runtime to inject performance counters. They are intrusive by nature and add significant performance overhead. However, you can profile regions of code very selectively using instrumenting profilers. Sampling profilers pause the runtime and collect its state at 'sampling intervals'. By collecting enough samples, it gets to know where the program spends most of its time. For example, at a sampling interval of 1ms, the profiler would have collected 1000 samples in a second. A sampling profiler also works for code that executes faster than the sampling interval, as the frequency of pausing and sampling is proportional to the overall execution time of any code.
Profiling is not meant only for measuring execution time. Capable profilers can provide a view of memory analysis, garbage collection, threads, and so on. A combination of such tools is helpful to find memory leaks, garbage collection issues, and so on.
Simply put, optimization is minimizing a program's resource consumption after performance analysis. The symptoms of a poorly performing program are observed in terms of high latency, low throughput, unresponsiveness, instability, high memory consumption, and high CPU consumption. During performance analysis, you may profile the program in order to identify bottlenecks and tune the performance incrementally by observing performance parameters.
Better and suitable algorithms are an all-round good way to optimize code. CPU bound code can be optimized with computationally cheaper operations. Cache bound code can try using less memory lookups to keep a good hit ratio. Memory bound code can use adaptive memory usage and conservative data representation to store in memory for optimization. I/O bound code can attempt to serialize as little data as possible, and can batch operations to make the operation less chatty for better performance. Parallelism and distribution are other overall good ways to increase performance.
Most of the computer hardware and operating systems we use today provide concurrency. On the x86 architecture, hardware support for concurrency can be traced as far back as the 80286 chip. Concurrency is the simultaneous execution of more than one process on the same computer. In older processors, concurrency was implemented using a context switch by the operating system kernel. When concurrent parts are executed in parallel by the hardware instead of merely switching context, it is called parallelism. Parallelism is the property of the hardware, though the software stack must support it in order for you to leverage it in your programs. You must write your program in a concurrent way to exploit the parallelism features of the hardware.
While concurrency is a natural way to exploit hardware parallelism and speed up operations, it is worth bearing in mind that having significantly higher concurrency than the parallelism your hardware can support is likely to schedule tasks to varying processor cores, thereby lowering branch prediction and increasing cache misses.
Low level processes/threads, mutexes, semaphores, locking, shared memory, and inter-process/thread communication are used for concurrency. The JVM has excellent support for these concurrency primitives and inter-thread communication. Clojure builds upon the JVM features to provide both low and higher level concurrency primitives that we will discuss in the concurrency chapter.
Resource utilization is the measure of the server, network, and storage resources consumed by an application. Resources include CPU, memory, disk I/O, network I/O, and so on. The application can be analyzed in terms of CPU bound, memory bound, cache bound, and I/O bound tasks. Resource utilization can be derived by means of benchmarking by measuring the utilization at a given throughput.
Workload is the quantification of how much work there is in hand to be carried out by the application. It is measured in total numbers of users, concurrent active users, transaction volume, and data volume. Processing a workload should take into account the load conditions, such as how much data the database currently holds, how filled up are the message queues, and the backlog of I/O tasks after which the new load will be processed.
Hardware and software have progressed over the years. Latencies for various operations put things into perspective. The latency numbers for 2013 are as shown in the following table. (Reproduced with the permission of Aurojit Panda and Colin Scott of Berkeley University: http://www.eecs.berkeley.edu/~rcs/research/interactive_latency.html)
Operation |
Time taken as of 2013 |
---|---|
L1 cache reference |
1 ns (nano second) |
Branch mis-predict |
3 ns |
L2 cache reference |
4 ns |
Mutex lock/unlock |
17 ns |
Compress 1KB with Zippy (http://code.google.com/p/snappy/) |
2 μs (1000 ns = 1 μs : micro second) |
Send 2000 bytes over commodity network |
500 ns (that is, 0.5 μs) |
SSD random read |
16 μs |
Roundtrip in same datacenter |
500 μs |
Read 1,000,000 bytes sequentially from SSD |
200 μs |
Disk seek |
4 ms (1000 μs = 1 ms) |
Read 1,000,000 bytes sequentially from disk |
2 ms |
Packet roundtrip CA to Netherlands |
150 ms |
We learned about the basics of what it is like to think deeper about performance. We saw the common performance vocabulary and also saw the use cases by which performance aspects might vary. We concluded by looking at the performance numbers of different hardware components, which is how the performance benefits reach our applications. In the next chapter, we will dive into performance aspects of various Clojure abstractions.