This book is written with intermediate to advanced Go developers in mind. These developers will be looking to squeeze more performance out of their Go application. To do this, this book will help to drive the four golden signals as defined in the Site Reliability Engineering Workbook (https://landing.google.com/sre/sre-book/chapters/monitoring-distributed-systems/). If we can reduce latency and errors, as well as increase traffic whilst reducing saturation, our programs will continue to be more performant. Following the ideology of the four golden signals is beneficial for anyone developing a Go application with performance in mind.
In this chapter, you'll be introduced to some of the core concepts of performance in computer science. You'll learn some of the history of the Go computer programming language, how its creators decided that it was important to put performance at the forefront of the language, and why writing performant Go is important. Go is a programming language designed with performance in mind, and this book will take you through some of the highlights on how to use some of Go's design and tooling to your advantage. This will help you to write more efficient code.
In this chapter, we will cover the following topics:
- Understanding performance in computer science
- A brief history of Go
- The ideology behind Go performance
These topics are provided to guide you in beginning to understand the direction you need to take to write highly performant code in the Go language.
For this book, you should have a moderate understanding of the Go language. Some key concepts to understand before exploring these topics include the following:
- The Go reference specification: https://golang.org/ref/spec
- How to write Go code: https://golang.org/doc/code.html
- Effective Go: https://golang.org/doc/effective_go.html
Throughout this book, there will be many code samples and benchmark results. These are all accessible via the GitHub repository at https://github.com/bobstrecansky/HighPerformanceWithGo/.
If you have a question or would like to request a change to the repository, feel free to create an issue within the repository at https://github.com/bobstrecansky/HighPerformanceWithGo/issues/new.
Performance in computer science is a measure of work that can be accomplished by a computer system. Performant code is vital to many different groups of developers. Whether you're part of a large-scale software company that needs to quickly deliver masses of data to customers, an embedded computing device programmer who has limited computing resources available, or a hobbyist looking to squeeze more requests out of the Raspberry Pi that you are using for your pet project, performance should be at the forefront of your development mindset. Performance matters, especially when your scale continues to grow.
It is important to remember that we are sometimes limited by physical bounds. CPU, memory, disk I/O, and network connectivity all have performance ceilings based on the hardware that you either purchase or rent from a cloud provider. There are other systems that may run concurrently alongside our Go programs that can also consume resources, such as OS packages, logging utilities, monitoring tools, and other binaries—it is prudent to remember that our programs are very frequently not the only tenants on the physical machines they run on.
Optimized code generally helps in many ways, including the following:
- Decreased response time: The total amount of time it takes to respond to a request.
- Decreased latency: The time delay between a cause and effect within a system.
- Increased throughput: The rate at which data can be processed.
- Higher scalability: More work can be processed within a contained system.
There are many ways to service more requests within a computer system. Adding more individual computers (often referred to as horizontal scaling) or upgrading to more powerful computers (often referred to as vertical scaling) are common practices used to handle demand within a computer system. One of the fastest ways to service more requests without needing additional hardware is to increase code performance. Performance engineering acts as a way to help with both horizontal and vertical scaling. The more performant your code is, the more requests you can handle on a single machine. This pattern can potentially result in fewer or less expensive physical hosts to run your workload. This is a large value proposition for many businesses and hobbyists alike, as it helps to drive down the cost of operation and improves the end user experience.
Big O notation (https://en.wikipedia.org/wiki/Big_O_notation) is commonly used to describe the limiting behavior of a function based on the size of the inputs. In computer science, Big O notation is used to explain how efficient algorithms are in comparison to one another—we'll discuss this more in detail in Chapter 2, Data Structures and Algorithms. Big O notation is important in optimizing performance because it is used as a comparison operator in explaining how well algorithms will scale. Understanding Big O notation will help you to write more performant code, as it will help to drive performance decisions in your code as the code is being composed. Knowing at what point different algorithms have relative strengths and weaknesses helps you to determine the correct choice for the implementation at hand. We can't improve what we can't measure—Big O notation helps us to give a concrete measurement to the problem statement at hand.
As we make our performance improvements, we will need to continually monitor our changes to view impact. Many methods can be used to monitor the long-term performance of computer systems. A couple of examples of these methods would be the following:
- Brendan Gregg's USE Method: Utilization, saturation, and errors (www.brendangregg.com/usemethod.html)
- Tom Wilkie's RED Metrics: Requests, errors, and duration (https://www.weave.works/blog/the-red-method-key-metrics-for-microservices-architecture/)
- Google SRE's four Golden Signals: Latency, traffic, errors, and saturation (https://landing.google.com/sre/sre-book/chapters/monitoring-distributed-systems/)
We will discuss these concepts further in Chapter 15, Comparing Code Quality Across Versions. These paradigms help us to make smart decisions about the performance optimizations in our code as well as avoid premature optimization. Premature optimization plays as a very crucial aspect for many a computer programmers. Very frequently, we have to determine what fast enough is. We can waste our time trying to optimize a small segment of code when many other code paths have an opportunity to improve from a performance perspective. Go's simplicity allows for additional optimization without cognitive load overhead or an increase in code complexity. The algorithms that we will discuss in Chapter 2, Data Structures and Algorithms, will help us to avoid premature optimization.
In this book, we will also attempt to understand what exactly we are optimizing for. The techniques for optimizing for CPU or memory utilization may look very different than optimizing for I/O or network latency. Being cognizant of your problem space as well as your limitations within your hardware and upstream APIs will help you to determine how to optimize for the problem statement at hand. Optimization also often shows diminishing returns. Frequently the return on development investment for a particular code hotspot isn't worthwhile based on extraneous factors, or adding optimizations will decrease readability and increase risk for the whole system. If you can determine whether an optimization is worth doing early on, you'll be able to have a more narrowly scoped focus and will likely continue to develop a more performant system.
It can be helpful to understand baseline operations within a computer system. Peter Norvig, the Director of Research at Google, designed a table (the image that follows) to help developers understand the various common timing operations on a typical computer (https://norvig.com/21-days.html#answers):
Having a clear understanding of how different parts of a computer can interoperate with one another helps us to deduce where our performance optimizations should lie. As derived from the table, it takes quite a bit longer to read 1 MB of data sequentially from disk versus sending 2 KBs over a 1 Gbps network link. Being able to have back-of-the-napkin math comparison operators for common computer interactions can very much help to deduce which piece of your code you should optimize next. Determining bottlenecks within your program becomes easier when you take a step back and look at a snapshot of the system as a whole.
Breaking down performance problems into small, manageable sub problems that can be improved upon concurrently is a helpful shift into optimization. Trying to tackle all performance problems at once can often leave the developer stymied and frustrated, and often lead to many performance efforts failing. Focusing on bottlenecks in the current system can often yield results. Fixing one bottleneck will often quickly identify another. For example, after you fix a CPU utilization problem, you may find that your system's disk can't write the values that are computed fast enough. Working through bottlenecks in a structured fashion is one of the best ways to create a piece of performant and reliable software.
Starting at the bottom of the pyramid in the following image, we can work our way up to the top. This diagram shows a suggested priority for making performance optimizations. The first two levels of this pyramid—the design level and algorithm and data structures level—will often provide more than ample real-world performance optimization targets. The following diagram shows an optimization strategy that is often efficient. Changing the design of a program alongside the algorithms and data structures are often the most efficient places to improve the speed and quality of code bases:
Design-level decisions often have the most measurable impact on performance. Determining goals during the design level can help to determine the best methodology for optimization. For example, if we are optimizing for a system that has slow disk I/O, we should prioritize lowering the number of calls to our disk. Inversely, if we are optimizing for a system that has limited compute resources, we need to calculate only the most essential values needed for our program's response. Creating a detailed design document at the inception of a new project will help with understanding where performance gains are important and how to prioritize time within the project. Thinking from a perspective of transferring payloads within a compute system can often lead to noticing places where optimization can occur. We will talk more about design patterns in Chapter 3, Understanding Concurrency.
Algorithm and data structure decisions often have a measurable performance impact on a computer program. We should focus on trying to utilize constant O(1), logarithmic O(log n), linear O(n), and log-linear O(n log n) functions while writing performant code. Avoiding quadratic complexity O(n2) at scale is also important for writing scalable programs. We will talk more about O notation and its relation to Go in Chapter 2, Data Structures and Algorithms.
Robert Griesemer, Rob Pike, and Ken Thompson created the Go programming language in 2007. It was originally designed as a general-purpose language with a keen focus on systems programming. The creators designed the Go language with a couple of core tenets in mind:
- Static typing
- Runtime efficiency
- Easy to learn
- High-performance networking and multiprocessing
Go was publicly announced in 2009 and v1.0.3 was released on March 3, 2012. At the time of the writing of this book, Go version 1.14 has been released, and Go version 2 is on the horizon. As mentioned, one of Go's initial core architecture considerations was to have high-performance networking and multiprocessing. This book will cover a lot of the design considerations that Griesemer, Pike, and Thompson have implemented and evangelized on behalf of their language. The designers created Go because they were unhappy with some of the choices and directions that were made in the C++ language. Long-running complications on large distributed compile clusters were a main source of pain for the creators. During this time, the authors started learning about the next C++ programming language release, dubbed C++x11. This C++ release had very many new features being planned, and the Go team decided they wanted to adopt an idiom of less is more in the computing language that they were using to do their work.
The authors of the language had their first meeting where they discussed starting with the C programming language, building features and removing extraneous functionality they didn't feel was important to the language. The team ended up starting from scratch, only borrowing some of the most atomic pieces of C and other languages they were comfortable with writing. After their work started to take form, they realized that they were taking away some of the core traits of other languages, notably the absence of headers, circular dependencies, and classes. The authors believe that even with the removal of many of these fragments, Go still can be more expressive than its predecessors.
The standard library in Go follows this same pattern. It has been designed with both simplicity and functionality in mind. Adding slices, maps, and composite literals to the standard library helped the language to become opinionated early. Go's standard library lives within $GOROOT and is directly importable. Having these default data structures built into the language enables developers to use these data structures effectively. The standard library packages are bundled in with the language distribution and are available immediately after you install Go. It is often mentioned that the standard library is a solid reference on how to write idiomatic Go. The reasoning on standard library idiomatic Go is these core library pieces are written clearly, concisely, and with quite a bit of context. They also add small but important implementation details well, such as being able to set timeouts for connections and being explicitly able to gather data from underlying functions. These language details have helped the language to flourish.
Some of the notable Go runtime features include the following:
- Garbage collection for safe memory management (a concurrent, tri-color, mark-sweep collector)
- Concurrency to support more than one task simultaneously (more about this in Chapter 3, Understanding Concurrency)
- Stack management for memory optimization (segmented stacks were used in the original implementation; stack copying is the current incantation of Go stack management)
Go's binary release also includes a vast toolset for creating optimized code. Within the Go binary, the go command has a lot of functions that help to build, deploy, and validate code. Let's discuss a couple of the core pieces of functionality as they relate to performance.
Godoc is Go's documentation tool that keeps the cruxes of documentation at the forefront of program development. A clean implementation, in-depth documentation, and modularity are all core pieces of building a scalable, performant system. Godoc helps with accomplishing these goals by auto-generating documentation. Godoc extracts and generates documentation from packages it finds within $GOROOT and $GOPATH. After generating this documentation, Godoc runs a web server and displays the generated documentation as a web page. Documentation for the standard library can be seen on the Go website. As an example, the documentation for the standard library pprof package can be found at https://golang.org/pkg/net/http/pprof/.
The addition of gofmt (Go's code formatting tool) to the language brought a different kind of performance to Go. The inception of gofmt allowed Go to be very opinionated when it comes to code formatting. Having precise enforced formatting rules makes it possible to write Go in a way that is sensible for the developer whilst letting the tool format the code to follow a consistent pattern across Go projects. Many developers have their IDE or text editor perform a gofmt command when they save the file that they are composing. Consistent code formatting reduces the cognitive load and allows the developer to focus on other aspects of their code, rather than determining whether to use tabs or spaces to indent their code. Reducing the cognitive load helps with developer momentum and project velocity.
Go's build system also helps with performance. The go build command is a powerful tool that compiles packages and their dependencies. Go's build system is also helpful in dependency management. The resulting output from the build system is a compiled, statically linked binary that contains all of the necessary elements to run on the platform that you've compiled for. go module (a new feature with preliminary support introduced in Go 1.11 and finalized in Go 1.13) is a dependency management system for Go. Having explicit dependency management for a language helps to deliver a consistent experience with groupings of versioned packages as a cohesive unit, allowing for more reproducible builds. Having reproducible builds helps developers to create binaries via a verifiable path from the source code. The optional step to create a vendored directory within your project also helps with locally storing and satisfying dependencies for your project.
Compiled binaries are also an important piece of the Go ecosystem. Go also lets you build your binaries for other target environments, which can be useful if you need to cross-compile a binary for another computer architecture. Having the ability to build a binary that can run on any platform helps you to rapidly iterate and test your code to find bottlenecks on alternate architectures before they become more difficult to fix. Another key feature of the language is that you can compile a binary on one machine with the OS and architecture flags, and that binary is executable on another system. This is crucial when the build system has high amounts of system resources and the build target has limited computing resources. Building a binary for two architectures is as simple as setting build flags:
To build a binary for macOS X on an x86_64 architecture, the following execution pattern is used:
GOOS=darwin GOARCH=amd64 go build -o myapp.osx
To build a binary for Linux on an ARM architecture, the following execution pattern is used:
GOOS=linux GOARCH=arm go build -o myapp.linuxarm
You can find a list of all the valid combinations of GOOS and GOARCH using the following command:
go tool dist list -json
This can be helpful in allowing you to see all of the CPU architectures and OSes that the Go language can compile binaries for.
The concept of benchmarking will also be a core tenant in this book. Go's testing functionality has performance built in as a first-class citizen. Being able to trigger a test benchmark during your development and release processes makes it possible to continue to deliver performant code. As new side effects are introduced, features are added, and code complexity increases, it's important to have a method for validating performance regression across a code base. Many developers add benchmarking results to their continuous integration practices to ensure that their code continues to be performant with all of the new pull requests added to a repository. You can also use the benchstat utility provided in the golang.org/x/perf/cmd/benchstat package to compare statistics about benchmarks. The following sample repository has an example of benchmarking the standard library's sort functions, at https://github.com/bobstrecansky/HighPerformanceWithGo/tree/master/1-introduction.
Having testing and benchmarking married closely in the standard library encourages performance testing as part of your code release process. It's always important to remember that benchmarks are not always indicative of real-world performance scenarios, so take the results you receive from them with a grain of salt. Logging, monitoring, profiling, and tracing a running system (as will be discussed in Chapter 12, Profiling Go Code; Chapter 13, Tracing Go Code; and Chapter 15, Comparing Code Quality Across Versions) can help to validate the assumptions that you have made with your benchmarking after you've committed the code you are working on.
Much of Go's performance stance is gained from concurrency and parallelism. Goroutines and channels are often used to perform many requests in parallel. The tools available for Go help to achieve near C-like performance, with very readable semantics. This is one of the many reasons that Go is commonly used by developers in large-scale solutions.
When Go was conceived, multi-core processors were beginning to become more and more commonplace in commercially available commodity hardware. The authors of the Go language recognized a need for concurrency within their new language. Go makes concurrent programming easy with goroutines and channels (which we will discuss in Chapter 3, Understanding Concurrency). Goroutines, lightweight computation threads that are distinct from OS threads, are often described as one of the best features of the language. Goroutines execute their code in parallel and complete when their work is done. The startup time for a goroutine is faster than the startup time for a thread, which allows a lot more concurrent work to occur within your program. Compared to a language such as Java that relies on OS threads, Go can be much more efficient with its multiprocessing model. Go is also intelligent about blocking operations with respect to goroutines. This helps Go to be more performant in memory utilization, garbage collection, and latency. Go's runtime uses the GOMAXPROCS variable to multiplex goroutines onto real OS threads. We will learn more about goroutines in Chapter 2, Data Structures and Algorithms.
Channels provide a model to send and receive data between goroutines, whilst skipping past synchronization primitives provided by the underlying platform. With properly thought-out goroutines and channels, we can achieve high performance. Channels can be both buffered and unbuffered, so the developer can pass a dynamic amount of data through an open channel until the value has been received by the receiver, at which time the channel is unblocked by the sender. If the channel is buffered, the sender blocks for the given size of the buffer. Once the buffer has been filled, the sender will unblock the channel. Lastly, the close() function can be invoked to indicate that the channel will not receive any more values. We will learn more about channels in Chapter 3, Understanding Concurrency.
Another initial goal was to approach the performance of C for comparable programs. Go also has extensive profiling and tracing tools baked into the language that we'll learn about in Chapter 12, Profiling Go Code, and Chapter 13, Tracing Go Code. Go gives developers the ability to see a breakdown of goroutine usage, channels, memory and CPU utilization, and function calls as they pertain to individual calls. This is valuable because Go makes it easy to troubleshoot performance problems with data and visualizations.
Go is often used in large-scale distributed systems due to its operational simplicity and its built-in network primitives in the standard library. Being able to rapidly iterate whilst developing is an essential part of building a robust, scalable system. High network latency is often an issue in distributed systems, and the Go team has worked to try and alleviate this concern on their platform. From standard library network implementations to making gRPC a first-class citizen for passing buffered messaging between clients and servers on a distributed platform, the Go language developers have put distributed systems problems at the forefront of the problem space for their language and have come up with some elegant solutions for these complex problems.
In this chapter, we learned the core concepts of performance in computer science. We also learned some of the history of the Go computer programming language and how its inception ties in directly with performance work. Lastly, we learned that Go is used in a myriad of different cases because of the utility, flexibility, and extensibility of the language. This chapter has introduced concepts that will continually be built upon in this book, allowing you to rethink the way you are writing your Go code.
In Chapter 2, Data Structures and Algorithms, we'll dive into data structures and algorithms. We'll learn about different algorithms, their Big O notation, and how these algorithms are constructed in Go. We'll also learn about how these theoretical algorithms relate to real-world problems and write performant Go to serve large amounts of requests quickly and efficiently. Learning more about these algorithms will help you to become more efficient in the second layer of the optimizations triangle that was laid out earlier in this chapter.