Scala High Performance Programming

By Vincent Theron , Michael Diamant
  • Instant online access to over 7,500+ books and videos
  • Constantly updated with 100+ new titles each month
  • Breadth and depth in over 1,000+ technologies

About this book

Scala is a statically and strongly typed language that blends functional and object-oriented paradigms. It has experienced growing popularity as an appealing and pragmatic choice to write production-ready software in the functional paradigm. Scala and the functional programming paradigm enable you to solve problems with less code and lower maintenance costs than the alternatives. However, these gains can come at the cost of performance if you are not careful.

Scala High Performance Programming arms you with the knowledge you need to create performant Scala applications. Starting with the basics of understanding how to define performance, we explore Scala's language features and functional programming techniques while keeping a close eye on performance throughout all the topics.

We introduce you as the newest software engineer at a fictitious financial trading company, named MV Trading. As you learn new techniques and approaches to reduce latency and improve throughput, you'll apply them to MV Trading’s business problems. By the end of the book, you will be well prepared to write production-ready, performant Scala software using the functional paradigm to solve real-world problems.

Publication date:
May 2016
Publisher
Packt
Pages
274
ISBN
9781786466044

 

Chapter 1.  The Road to Performance

We welcome you on a journey to learning pragmatic ways to use the Scala programming language and the functional programming paradigm to write performant and efficient software. Functional programming concepts, such as pure and higher-order functions, referential transparency, and immutability, are desirable engineering qualities. They allow us to write composable elements, maintainable software, and expressive and easy-to-reason-about code. However, in spite of all its benefits, functional programming is too often wrongly associated with degraded performance and inefficient code. It is our goal to convince you otherwise! This book explores how to take advantage of functional programming, the features of the Scala language, the Scala standard library, and the Scala ecosystem to write performant software.

Scala is a statically and strongly typed language that tries to elegantly blend both functional and object-oriented paradigms. It has experienced growing popularity in the past few years as both an appealing and pragmatic choice to write production-ready software in the functional paradigm. Scala code compiles to bytecode and runs on the Java Virtual Machine (JVM), which has a widely-understood runtime, is configurable, and provides excellent tooling to introspect and debug correctness and performance issues. An added bonus is Scala's great interoperability with Java, which allows you to use all the existing Java libraries. While the Scala compiler and the JVM receive constant improvements and already generate well-optimized bytecode, the onus remains on you, the developer, to achieve your performance goals.

Before diving into the Scala and JVM specifics, let's first develop an intuition for the holy grail that we seek: performance. In this first chapter, we will cover performance basics that are agnostic to the programming language. We will present and explain the terms and concepts that are used throughout this book.

In particular, we will look at the following topics:

  • Defining performance

  • Summarizing performance

  • Collecting measurements

We will also introduce our case study, a fictitious application based on real-world problems that will help us illustrate techniques and patterns that are presented later.

 

Defining performance


A performance vocabulary arms you with a way to qualify the type of issues at-hand and often helps guide you towards a resolution. Particularly when time is of the essence, a strong intuition and a disciplined strategy are assets to resolve performance problems.

Let's begin by forming a common understanding of the term, performance. This term is used to qualitatively or quantitatively evaluate the ability to accomplish a goal. The goal at-hand can vary significantly. However, as a professional software developer, the goal ultimately links to a business goal. It is paramount to work with your business team to characterize business domain performance sensitivities. For a consumer-facing shopping website, agreeing upon the number of concurrent app users and acceptable request response times is relevant. In a financial trading company, trade latency might be the most important because speed is a competitive advantage. It is also relevant to keep in mind nonfunctional requirements, such as "trade executions can never be lost," because of industry regulations and external audits. These domain constraints will also impact your software's performance characteristics. Building a clear and agreed upon picture of the domain that you operate in is a crucial first step. If you cannot define these constraints, an acceptable solution cannot be delivered.

Note

Gathering requirements is an involved topic outside the scope of this book. If you are interested in delving deeper into this topic, we recommend two books by Gojko Adzic: Impact Mapping: Making a big impact with software products and projects (http://www.amazon.com/Impact-Mapping-software-products-projects-ebook/dp/B009KWDKVA) and Fifty Quick Ideas to Improve Your User Stories (http://www.amazon.com/Fifty-Quick-Ideas-Improve-Stories-ebook/dp/B00OGT2U7M).

Performant software

Designing performant software is one of our goals as software engineers. Thinking about this goal leads to a commonly asked question, "What performance is good enough?" We use the term performant to characterize performance that satisfies the minimally-accepted threshold for "good enough." We aim to meet and, if possible, exceed the minimum thresholds for acceptable performance. Consider this: without an agreed upon set of criteria for acceptable performance, it is by definition impossible to write performant software! This statement illustrates the overwhelming importance of defining the desired outcome as a prerequisite to writing performant software.

Note

Take a moment to reflect on the meaning of performant for your domain. Have you had struggles maintaining software that meets your definition of performant? Consider the strategies that you applied to solve performance dilemmas. Which ones were effective and which ones were ineffective? As you progress through the book, keep this in mind so that you can check which techniques can help you meet your definition of performant more effectively.

Hardware resources

In order to define criteria for performant software, we must expand the performance vocabulary. First, become aware of your environment's resources. We use the term resource to cover all the infrastructure that your software uses to run. Refer to the following resource checklist, which lists the resources that you should collect prior to engaging in any performance tuning exercise:

  • Hardware type: physical or virtualized

  • CPUs:

    • Number of cores

    • L1, L2, and L3 cache sizes

    • NUMA zones

  • RAM (for example, 16 GB)

  • Network connectivity rating (for example, 1GbE or 10GbE)

  • OS and kernel versions

  • Kernel settings (for example, TCP socket receive buffer size)

  • JVM version

Itemizing the resource checklist forces you to consider the capabilities and limitations of your operating environment.

Note

Excellent resources for kernel optimization include Red Hat Performance Tuning Guide (https://goo.gl/gDS5mY) and presentations and tutorials by Brendan Gregg (http://www.brendangregg.com/linuxperf.html).

Latency and throughput

Latency and throughput define two types of performance, which are often used to establish the criteria for performant software. The illustration of a highway, like the following photo of the German Autobahn, is a great way to develop an intuition of these types of performance:

The Autobahn helps us think about latency and throughput. (image wikimedia, https://en.wikipedia.org/wiki/Highway#/media/File:Blick_auf_A_2_bei_Rastst%C3%A4tte_Lehrter_See_(2009).jpg. License Creative Commons CC BY-SA 3.0)

Latency describes the amount of time that it takes for an observed process to be completed. Here, the process is a single car driving down one lane of the highway. If the highway is free of congestion, then the car is able to drive down the highway quickly. This is described as a low-latency process. If the highway is congested, the trip time increases, which is characterized as a high-latency or latent process. Performance optimizations that are within your control are also captured by this analogy. You can imagine that reworking an expensive algorithm from polynomial to linear execution time is similar to either improving the quality of the highway or the car's tires to reduce road friction. The reduction in friction allows the car to cross the highway with lower latency. In practice, latency performance objectives are often defined in terms of a maximum tolerable latency for your business domain.

Throughput defines the observed rate at which a process is completed. Using the highway analogy, the number of cars traveling from point A to point B per unit of time is the highway's throughput. For example, if there are three traffic lanes and cars travel in each lane at a uniform rate, then the throughput is: (the number of cars per lane that traveled from point A to point B during the observation period) * 3. Inductive reasoning may suggest that there is a strong negative correlation between throughput and latency. That is, as latency increases, throughput decreases. As it turns out, there are a number of cases where this type of reasoning does not hold true. Keep this in mind as we continue expanding our performance vocabulary to better understand why this happens. In practice, throughput is often defined by the maximum number of transactions per second your software can support. Here, a transaction means a unit of work in your domain (for example, orders processed or trades executed).

Note

Thinking back to the recent performance issues that you faced, how would you characterize them? Did you have a latency or a throughput problem? Did your solution increase throughput while lowering latency?

Bottlenecks

A bottleneck refers to the slowest part of the system. By definition, all systems, including well-tuned ones, have a bottleneck because there is always one processing step that is measured to be the slowest. Note that the latency bottleneck may not be the throughput bottleneck. That is, multiple types of bottleneck can exist at the same time. This is another illustration of why it is important to understand whether you are combating a throughput or a latency performance issue. Use the process of identifying your system's bottlenecks to provide you with a directed focus to attack your performance dilemmas.

From personal experience, we have seen how time is wasted when the operating environment checklist is ignored. Once, while working in the advertising domain on a high-throughput real-time bidding (RTB) platform, we chased a throughput issue for several days without success. After bootstrapping an RTB platform, we began optimizing for a higher request throughput goal because request throughput is a competitive advantage in our industry. Our business team identified an increase from 40,000 requests per second (RPS) to 75,000 RPS as a major milestone. Our tuning efforts consistently yielded about 60,000 RPS. This was a real head scratcher because the system did not appear to exhaust system resources. CPU utilization was well under 100%, and previous experiments to increase heap space did not yield improvements.

The "aha!" moment came when we realized that the system was deployed within AWS with the default network connectivity configured to 1 Gigabit Ethernet. The requests processed by the system are about 2KB per request. We performed some basic arithmetic to identify the theoretical maximum throughput rate. 1 Gigabit is equivalent to 125,000 kilobytes. 125,000 kilobytes / 2 kilobytes per request translates to a theoretical maximum of 62,500 RPS. This arithmetic was confirmed by running a test of our network throughput with a tool named iPerf. Sure enough, we had maxed out our network connectivity!

 

Summarizing performance


We properly defined some of the main concepts around performance, namely latency and throughput, but we still lack a concrete way to quantify our measurements. To continue with our example of cars driving down a highway, we want to find a way to answer the question, "How long a drive should I expect to go from point A to point B?" The first step is to measure our trip time on multiple occasions to collect empirical information.

The following table catalogs our observations. We still need a way to interpret these data points and summarize our measurements to give an answer:

Observed trip

Travel time in minutes

Trip 1

28

Trip 2

37

Trip 3

17

Trip 4

38

Trip 5

18

The problem with averages

A common mistake is to rely on averages to measure the performance of a system. An arithmetic average is fairly easy to calculate. This is the sum of all collected values divided by the number of values. Using the previous sample of data points, we can infer that on average we should expect a drive of approximately 27 minutes. With this simple example, it is easy to see what makes the average such a poor choice. Out of our five observations, only Trip 1 is close to our average while all the other trips are quite different. The fundamental problem with averages is that it is a lossy summary statistic. Information is lost when moving from a series of observations to the average because it is impossible to retain all the characteristics of the original observations in a single data point.

To illustrate how an average loses information, consider the three following datasets that represent the measured latency required to process a request to a web service:

In the first dataset, there are four requests that take between 280 ms and 305 ms to be completed. Compare these latencies with the latencies in the second dataset, as follows:

The second dataset shows a more volatile mixture of latencies. Would you prefer to deploy the first or the second service into your production environment? To add more variety into the mix, a third dataset is shown, as follows:

Although each of these datasets has a vastly different distribution, the averages are all the same, and equal 292 ms! Imagine having to maintain the web service that is represented by dataset 1 with the goal of ensuring that 75% of clients receive a response in less than 300 ms. Calculating the average out of dataset 3 will give you the impression that you are meeting your objective, while in reality only half of your clients actually experience a fast enough response (requests with IDs 1 and 2).

Percentiles to the rescue

The key term in the previous discussion is "distribution." Measuring the distribution of a system's performance is the most robust way to ensure that you understand the behavior of the system. If an average is an ineffective choice to take into account the distribution of our measurements, then we need to find a different tool. In the field of statistics, a percentile meets our criteria to interpret the distribution of observations. A percentile is a measurement indicating the following value into which a given percentage of observations in a group of observations falls. Let's make this definition more concrete with an example. Going back to our web service example, imagine that we observe the following latencies:

Request

Latency in milliseconds

Request 1

10

Request 2

12

Request 3

13

Request 4

13

Request 5

9

Request 6

27

Request 7

12

Request 8

7

Request 9

75

Request 10

80

The 20th percentile is defined as the observed value that represents 20% of all the observations. As there are ten observed values, we want to find the value that represents two observations. In this example, the 20th percentile latency is 9 ms because two values (that is, 20% of the total observations) are less than or equal to 10 ms (9 ms and 7 ms). Contrast this latency with the 90th percentile. The value representing 90% of the observations: 75 ms (as nine observations out of ten are less than or equal to 75 ms).

Where the average hides the distribution of our measurements, the percentile provides us with deeper insight and highlights that tail-end observations (the observations near the 100th percentile) experience extreme latencies.

If you remember the beginning of this section, we were trying to answer the question, "How long a drive should I expect to go from point A to point B?" After spending some time exploring the tools available, we realized that the original question is not the one we are actually interested in. A more pragmatic question is, "How long do 90% of the cars take to go from point A to point B?"

 

Collecting measurements


Our performance measurement toolkit is already filled with useful information. We defined a common vocabulary to talk about and explore performance. We also agreed on a pragmatic way to summarize performance. The next step in our journey is to answer the question, "In order to summarize them, how do I collect performance measurements?" This section introduces you to techniques to collect measurements. In the next chapter, we dive deeper and focus on collecting data from Scala code. We will show you how to use various tools and libraries designed to work with the JVM and understand your programs better.

Using benchmarks to measure performance

Benchmarks are a black-box kind of measurement. Benchmarks assess a whole system's performance by submitting various kinds of load as input and measuring latency and throughput as system outputs. As an example, imagine that we are working on a typical shopping cart web application. To benchmark this application, we can write a simple HTTP client to query our service and record the time taken to complete a request. This client can be used to send an increasing number of requests per second and output a summary of the recorded response times.

Multiple kinds of benchmark exist to answer different questions about your system. You can replay historical production data to make sure that your application is meeting the expected performance goals when handling realistic load. Load and stress test benchmarks identify the breaking points of your application, and they exercise its robustness when receiving exceptionally high load for an extended period of time.

Benchmarks are also a great tool to compare different iterations of the same application and either detect performance regression or confirm improvements. By executing the same benchmark against two versions of your code, you can actually prove that your recent changes yielded better performance.

For all their usefulness, benchmarks do not provide any insight into how each part of the software performs; hence, they are black-box tests. Benchmarks do not help us identify bottlenecks or determine which part of the system should be improved to yield better overall performance. To look into the black box, we turn to profiling.

Profiling to locate bottlenecks

As opposed to benchmarking, profiling is intended to be used to analyze the internal characteristics of your application. A profiler enables white-box testing to help you identify bottlenecks by capturing the execution time and resource consumption of each part of your program. By examining your application at runtime, a profiler provides you with great details about the behavior of your code, including the following:

  • Where CPU cycles are spent

  • How memory is used, and where objects are instantiated and released (or not, if you have a memory leak!)

  • Where IO operations are performed

  • Which threads are running, blocked, or idle

Most profilers instrument the code under observation, either at compile time or runtime, to inject counters and profiling components. This instrumentation imposes a runtime cost that degrades system throughput and latency. For this reason, profilers should not be used to evaluate the expected throughput and latency of a system in production (as a reminder, this is a use case for a benchmark).

In general, you should always profile your application before deciding to do any performance-driven improvement. You should make sure that the part of the code you are planning to improve actually is a bottleneck.

Pairing benchmarks and profiling

Profilers and benchmarks have different purposes, and they help us answer different questions. A typical workflow to improve performance should take advantage of both these techniques and leverage their strengths to optimize the code improvement process. In practice, this workflow looks like the following:

  1. Run a benchmark against the current version of the code to establish a performance baseline.

  2. Use a profiler to analyze the internal behavior and locate a bottleneck.

  3. Improve the section causing a bottleneck.

  4. Run the same benchmark from step 1 against the new code.

  5. Compare the results from the new benchmark against the baseline benchmark to determine the effectiveness of your changes.

Keep in mind, it is important to run all benchmarking and profiling sessions in the same environment. Consult your resource checklist to ensure that your environment remains constant across tests. Any change in your resources invalidates your test results. Just like a science experiment, you must be careful to change only one part of the experiment at a time.

Note

What roles do benchmarking and profiling play in your development process? Do you always profile your application before deciding on the next part of the code to improve? Does your definition of "done" include benchmarking? Are you able to benchmark and profile your application in an environment as close to production as possible?

 

A case study


Throughout this book, we will provide code examples to illustrate the topics that are covered. To make the techniques that were described previously as useful as possible in your professional life, we are relating our examples to a fictitious financial trading company, named MV Trading. The company name originates from the combination of the first name initials of your dear authors. Coincidentally, the initials also form the Unix file move command, symbolizing that the company is on-the-move! Since inception one year ago, MV Trading has operated successful stock trading strategies for a small pool of clients. Software infrastructure has been rapidly built in the last twelve months to support various arms of the business. MV Trading built software to support real-time trading (that is, buying and selling) on various stock market exchanges, and it also built a historical trade execution analysis to create better performing trading algorithms. If you do not have financial domain knowledge, do not worry. With each example, we also define key parts of the domain.

 

Tooling


We recommend that you install all the necessary tooling up-front so that you can work through these examples without setup time. The installation instructions are brief because detailed installation guides are available on the websites that accompany each required tool. The following software is needed for all upcoming chapters:

  • Oracle JDK 8+ using v1.8 u66 at the time of writing

  • sbt v0.13+, using v0.13.11 at the time of writing, which is available at http://www.scala-sbt.org/

Tip

Detailed steps to download the code bundle are mentioned in the Preface of this book. Please have a look at it. The code bundle for the book is also hosted on GitHub at https://github.com/PacktPublishing/Scala-High-Performance-Programming. We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out!

 

Summary


In this chapter, we focused on understanding how to talk about performance. We built a vocabulary to discuss performance, determined the best way to summarize performance with percentiles, and developed an intuition to measure performance. We introduced our case study, and then we installed the required tools to run the code samples and the source code provided with this book. In the next chapter, we will look at available tools to measure JVM performance and analyze the performance of our Scala applications.

About the Authors

  • Vincent Theron

    Vincent Theron is a professional software engineer with 9 years of experience. He discovered Scala 6 years ago and uses it to build highly scalable and reliable applications. He designs software to solve business problems in various industries, including online gambling, financial trading, and, most recently, advertising. He earned a master's degree in computer science and engineering from Université Paris-Est Marne-la-Vallée. Vincent lives in the Boston area with his wife, his son, and two furry cats.

    Browse publications by this author
  • Michael Diamant

    Michael Diamant is a professional software engineer and functional programming enthusiast. He began his career in 2009 focused on Java and the object-oriented programming paradigm. After learning about Scala in 2011, he has focused on using Scala and the functional programming paradigm to build software systems in the financial trading and advertising domains. Michael is a graduate of Worcester Polytechnic Institute and lives in the Boston area.

    Browse publications by this author

Recommended For You

Scala Design Patterns - Second Edition

Learn how to write efficient, clean, and reusable code with Scala

By Ivan Nikolov
Learning Concurrent Programming in Scala - Second Edition

Learn the art of building intricate, modern, scalable, and concurrent applications using Scala

By Aleksandar Prokopec
Scala Microservices

Design, build, and run Microservices using Scala elegantly

By Jatin Puri and 1 more
Scala Programming Projects

Discover unique features and powerful capabilities of Scala Programming as you build projects in a wide range of domains

By Mikaël Valot and 1 more