Home Cloud & Networking Mastering Distributed Tracing

Mastering Distributed Tracing

By Yuri Shkuro
books-svg-icon Book
eBook $38.99 $26.99
Print $54.99
Subscription $15.99 $10 p/m for three months
$10 p/m for first 3 months. $15.99 p/m after that. Cancel Anytime!
What do you get with a Packt Subscription?
This book & 7000+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with a Packt Subscription?
This book & 6500+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with eBook + Subscription?
Download this book in EPUB and PDF formats, plus a monthly download credit
This book & 6500+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with a Packt Subscription?
This book & 6500+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with eBook?
Download this book in EPUB and PDF formats
Access this title in our online reader
DRM FREE - Read whenever, wherever and however you want
Online reader with customised display settings for better reading experience
What do you get with video?
Download this video in MP4 format
Access this title in our online reader
DRM FREE - Watch whenever, wherever and however you want
Online reader with customised display settings for better learning experience
What do you get with video?
Stream this video
Access this title in our online reader
DRM FREE - Watch whenever, wherever and however you want
Online reader with customised display settings for better learning experience
What do you get with Audiobook?
Download a zip folder consisting of audio files (in MP3 Format) along with supplementary PDF
What do you get with Exam Trainer?
Flashcards, Mock exams, Exam Tips, Practice Questions
Access these resources with our interactive certification platform
Mobile compatible-Practice whenever, wherever, however you want
BUY NOW $10 p/m for first 3 months. $15.99 p/m after that. Cancel Anytime!
eBook $38.99 $26.99
Print $54.99
Subscription $15.99 $10 p/m for three months
What do you get with a Packt Subscription?
This book & 7000+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with a Packt Subscription?
This book & 6500+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with eBook + Subscription?
Download this book in EPUB and PDF formats, plus a monthly download credit
This book & 6500+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with a Packt Subscription?
This book & 6500+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with eBook?
Download this book in EPUB and PDF formats
Access this title in our online reader
DRM FREE - Read whenever, wherever and however you want
Online reader with customised display settings for better reading experience
What do you get with video?
Download this video in MP4 format
Access this title in our online reader
DRM FREE - Watch whenever, wherever and however you want
Online reader with customised display settings for better learning experience
What do you get with video?
Stream this video
Access this title in our online reader
DRM FREE - Watch whenever, wherever and however you want
Online reader with customised display settings for better learning experience
What do you get with Audiobook?
Download a zip folder consisting of audio files (in MP3 Format) along with supplementary PDF
What do you get with Exam Trainer?
Flashcards, Mock exams, Exam Tips, Practice Questions
Access these resources with our interactive certification platform
Mobile compatible-Practice whenever, wherever, however you want
  1. Free Chapter
    Why Distributed Tracing?
About this book
Mastering Distributed Tracing will equip you to operate and enhance your own tracing infrastructure. Through practical exercises and code examples, you will learn how end-to-end tracing can be used as a powerful application performance management and comprehension tool. The rise of Internet-scale companies, like Google and Amazon, ushered in a new era of distributed systems operating on thousands of nodes across multiple data centers. Microservices increased that complexity, often exponentially. It is harder to debug these systems, track down failures, detect bottlenecks, or even simply understand what is going on. Distributed tracing focuses on solving these problems for complex distributed systems. Today, tracing standards have developed and we have much faster systems, making instrumentation less intrusive and data more valuable. Yuri Shkuro, the creator of Jaeger, a popular open-source distributed tracing system, delivers end-to-end coverage of the field in Mastering Distributed Tracing. Review the history and theoretical foundations of tracing; solve the data gathering problem through code instrumentation, with open standards like OpenTracing, W3C Trace Context, and OpenCensus; and discuss the benefits and applications of a distributed tracing infrastructure for understanding, and profiling, complex systems.
Publication date:
February 2019
Publisher
Packt
Pages
444
ISBN
9781788628464

 

Chapter 1. Why Distributed Tracing?

Modern, internet-scale, cloud-native applications are very complex distributed systems. Building them is hard and debugging them is even harder. The growing popularity of microservices and functions-as-a-service (also known as FaaS or serverless) only exacerbates the problem; these architectural styles bring many benefits to the organizations adopting them, while complicating some of the aspects of operating the systems even further.

In this chapter, I will talk about the challenges of monitoring and troubleshooting distributed systems, including those built with microservices, and discuss how and why distributed tracing is in a unique position among the observability tools to address this problem. I will also describe my personal history with distributed tracing and why I decided to write this book.

 

Microservices and cloud-native applications


In the last decade, we saw a significant shift in how modern, internet-scale applications are being built. Cloud computing (infrastructure as a service) and containerization technologies (popularized by Docker) enabled a new breed of distributed system designs commonly referred to as microservices (and their next incarnation, FaaS). Successful companies like Twitter and Netflix have been able to leverage them to build highly scalable, efficient, and reliable systems, and to deliver more features faster to their customers.

While there is no official definition of microservices, a certain consensus has evolved over time in the industry. Martin Fowler, the author of many books on software design, argues that microservices architectures exhibit the following common characteristics [1]:

  • Componentization via (micro)services: The componentization of functionality in a complex application is achieved via services, or microservices, that are independent processes communicating over a network. The microservices are designed to provide fine-grained interfaces and to be small in size, autonomously developed, and independently deployable.

  • Smart endpoints and dumb pipes: The communications between services utilize technology-agnostic protocols such as HTTP and REST, as opposed to smart mechanisms like the Enterprise Service Bus (ESB).

  • Organized around business capabilities: Products not projects: the services are organized around business functions ("user profile service" or "fulfillment service"), rather than technologies. The development process treats the services as continuously evolving products rather than projects that are considered to be completed once delivered.

  • Decentralized governance: Allows different microservices to be implemented using different technology stacks.

  • Decentralized data management: Manifests in the decisions for both the conceptual data models and the data storage technologies being made independently between services.

  • Infrastructure automation: The services are built, released, and deployed with automated processes, utilizing automated testing, continuous integration, and continuous deployment.

  • Design for failure: The services are always expected to tolerate failures of their dependencies and either retry the requests or gracefully degrade their own functionality.

  • Evolutionary design: Individual components of a microservices architecture are expected to evolve independently, without forcing upgrades on the components that depend on them.

Because of the large number of microservices involved in building modern applications, rapid provisioning, rapid deployment via decentralized continuous delivery, strict DevOps practices, and holistic service monitoring are necessary to effectively develop, maintain, and operate such applications. The infrastructure requirements imposed by the microservices architectures spawned a whole new area of development of infrastructure platforms and tools for managing these complex cloud-native applications. In 2015, the Cloud Native Computing Foundation (CNCF) was created as a vendor-neutral home for many emerging open source projects in this area, such as Kubernetes, Prometheus, Linkerd, and so on, with a mission to "make cloud-native computing ubiquitous."

"Cloud native technologies empower organizations to build and run scalable applications in modern, dynamic environments such as public, private, and hybrid clouds. Containers, service meshes, microservices, immutable infrastructure, and declarative APIs exemplify this approach.

These techniques enable loosely coupled systems that are resilient, manageable, and observable. Combined with robust automation, they allow engineers to make high-impact changes frequently and predictably with minimal toil."

-- Cloud Native Computing Foundation Charter [2]

At the time of writing, the list of graduated and incubating projects at CNCF [3] contained 20 projects (Figure 1.1). They all have a single common theme: providing a platform for efficient deployment and operation of cloud-native applications. The observability tools occupy an arguably disproportionate (20 percent) number of slots:

  • Prometheus: A monitoring and alerting platform

  • Fluentd: A logging data collection layer

  • OpenTracing: A vendor-neutral APIs and instrumentation for distributed tracing

  • Jaeger: A distributed tracing platform

CNCF sandbox projects, the third category not shown in Figure 1.1, include two more monitoring-related projects: OpenMetrics and Cortex. Why is observability in such high demand for cloud-native applications?

Figure 1.1: Graduated and incubating projects at CNCF as of January 2019. Project names and logos are registered trademarks of the Linux Foundation.

 

What is observability?


The term "observability" in control theory states that the system is observable if the internal states of the system and, accordingly, its behavior, can be determined by only looking at its inputs and outputs. At the 2018 Observability Practitioners Summit [4], Bryan Cantrill, the CTO of Joyent and one of the creators of the tool dtrace, argued that this definition is not practical to apply to software systems because they are so complex that we can never know their complete internal state, and therefore the control theory's binary measure of observability is always zero (I highly recommend watching his talk on YouTube: https://youtu.be/U4E0QxzswQc). Instead, a more useful definition of observability for a software system is its "capability to allow a human to ask and answer questions". The more questions we can ask and answer about the system, the more observable it is.

Figure 1.2: The Twitter debate

There are also many debates and Twitter zingers about the difference between monitoring and observability. Traditionally, the term monitoring was used to describe metrics collection and alerting. Sometimes it is used more generally to include other tools, such as "using distributed tracing to monitor distributed transactions." The definition by Oxford dictionaries of the verb "monitor" is "to observe and check the progress or quality of (something) over a period of time; keep under systematic review." However, it is better thought of as the process of observing certain a priori defined performance indicators of our software system, such as those measuring an impact on the end user experience, like latency or error counts, and using their values to alert us when these signals indicate an abnormal behavior of the system. Metrics, logs, and traces can all be used as a means to extract those signals from the application. We can then reserve the term "observability" for situations when we have a human operator proactively asking questions that were not predefined. As Brian Cantrill put it in his talk, this process is debugging, and we need to "use our brains when debugging." Monitoring does not require a human operator; it can and should be fully automated.

"If you want to talk about (metrics, logs, and traces) as pillars of observability–great.

The human is the foundation of observability!"

-- Brian Cantrill

In the end, the so-called "three pillars of observability" (metrics, logs, and traces) are just tools, or more precisely, different ways of extracting sensor data from the applications. Even with metrics, the modern time series solutions like Prometheus, InfluxDB, or Uber's M3 are capable of capturing the time series with many labels, such as which host emitted a particular value of a counter. Not all labels may be useful for monitoring, since a single misbehaving service instance in a cluster of thousands does not warrant an alert that wakes up an engineer. But when we are investigating an outage and trying to narrow down the scope of the problem, the labels can be very useful as observability signals.

 

The observability challenge of microservices


By adopting microservices architectures, organizations are expecting to reap many benefits, from better scalability of components to higher developer productivity. There are many books, articles, and blog posts written on this topic, so I will not go into that. Despite the benefits and eager adoption by companies large and small, microservices come with their own challenges and complexity. Companies like Twitter and Netflix were successful in adopting microservices because they found efficient ways of managing that complexity. Vijay Gill, Senior VP of Engineering at Databricks, goes as far as saying that the only good reason to adopt microservices is to be able to scale your engineering organization and to "ship the org chart" [2].

Vijay Gill's opinion may not be a popular one yet. A 2018 "Global Microservices Trends" study [6] by Dimensional Research® found that over 91% of interviewed professionals are using or have plans to use microservices in their systems. At the same time, 56% say each additional microservice "increases operational challenges," and 73% find "troubleshooting is harder" in a microservices environment. There is even a famous tweet about adopting microservices:

Figure 1.3: The tweet in question

Consider Figure 1.4, which gives a visual representation of a subset of microservices in Uber's microservices architecture, rendered by Uber's distributed tracing platform Jaeger. It is often called a service dependencies graph or a topology map. The circles (nodes in the graph) represent different microservices. The edges are drawn between nodes that communicate with each other. The diameter of the nodes is proportional to the number of other microservices connecting to them, and the width of an edge is proportional to the volume of traffic going through that edge.

The picture is already so complex that we don't even have space to include the names of the services (in the real Jaeger UI you can see them by moving the mouse over nodes). Every time a user takes an action on the mobile app, a request is executed by the architecture that may require dozens of different services to participate in order to produce a response. Let's call the path of this request a distributed transaction.

Figure 1.4: A visual representation of a subset of Uber's microservices architecture and a hypothetical transaction

So, what are the challenges of this design? There are quite a few:

  • In order to run these microservices in production, we need an advanced orchestration platform that can schedule resources, deploy containers, auto-scale, and so on. Operating an architecture of this scale manually is simply not feasible, which is why projects like Kubernetes became so popular.

  • In order to communicate, microservices need to know how to find each other on the network, how to route around problematic areas, how to perform load balancing, how to apply rate limiting, and so on. These functions are delegated to advanced RPC frameworks or external components like network proxies and service meshes.

  • Splitting a monolith into many microservices may actually decrease reliability. Suppose we have 20 components in the application and all of them are required to produce a response to a single request. When we run them in a monolith, our failure modes are restricted to bugs and potentially a crush of the whole server running the monolith. But if we run the same components as microservices, on different hosts and separated by a network, we introduce many more potential failure points, from network hiccups, to resource constraints due to noisy neighbors. Even if each microservice succeeds in 99.9% of cases, the whole application that requires all of them to work for a given request can only succeed 0.99920 = 98.0% of the time. Distributed, microservices-based applications must become more complicated, for example, implementing retries or opportunistic parallel reads, in order to maintain the same level of availability.

  • The latency may also increase. Assume each microservice has 1 ms average latency, but the 99th percentile is 1s. A transaction touching just one of these services has a 1% chance to take ≥ 1s. A transaction touching 100 of these services has 1 - (1 - 0.01)100 = 63% chance to take ≥ 1s.

  • Finally, the observability of the system is dramatically reduced if we try to use traditional monitoring tools.

When we see that some requests to our system are failing or slow, we want our observability tools to tell us the story about what happens to that request. We want to be able to ask questions like these:

  • Which services did a request go through?

  • What did every microservice do when processing the request?

  • If the request was slow, where were the bottlenecks?

  • If the request failed, where did the error happen?

  • How different was the execution of the request from the normal behavior of the system?

    • Were the differences structural, that is, some new services were called, or vice versa, some usual services were not called?

    • Were the differences related to performance, that is, some service calls took a longer or shorter time than usual?

  • What was the critical path of the request?

  • And perhaps most importantly, if selfishly, who should be paged?

Unfortunately, traditional monitoring tools are ill-equipped to answer these questions for microservices architectures.

 

Traditional monitoring tools


Traditional monitoring tools were designed for monolith systems, observing the health and behavior of a single application instance. They may be able to tell us a story about that single instance, but they know almost nothing about the distributed transaction that passed through it. These tools lack the context of the request.

Metrics

It goes like this: "Once upon a time…something bad happened. The end." How do you like this story? This is what the chart in Figure 1.5 tells us. It's not completely useless; we do see a spike and we could define an alert to fire when this happens. But can we explain or troubleshoot the problem?

Figure 1.5: A graph of two time series representing (hypothetically) the volume of traffic to a service

Metrics, or stats, are numerical measures recorded by the application, such as counters, gauges, or timers. Metrics are very cheap to collect, since numeric values can be easily aggregated to reduce the overhead of transmitting that data to the monitoring system. They are also fairly accurate, which is why they are very useful for the actual monitoring (as the dictionary defines it) and alerting.

Yet the same capacity for aggregation is what makes metrics ill-suited for explaining the pathological behavior of the application. By aggregating data, we are throwing away all the context we had about the individual transactions.

In Chapter 11, Integration with Metrics and Logs, we will talk about how integration with tracing and context propagation can make metrics more useful by providing them with the lost context. Out of the box, however, metrics are a poor tool to troubleshoot problems within microservices-based applications.

Logs

Logging is an even more basic observability tool than metrics. Every programmer learns their first programming language by writing a program that prints (that is, logs) "Hello, World!" Similar to metrics, logs struggle with microservices because each log stream only tells us about a single instance of a service. However, the evolving programming paradigm creates other problems for logs as a debugging tool. Ben Sigelman, who built Google's distributed tracing system Dapper [7], explained it in his KubeCon 2016 keynote talk [8] as four types of concurrency (Figure 1.6):

Figure 1.6: Evolution of concurrency

Years ago, applications like early versions of Apache HTTP Server handled concurrency by forking child processes and having each process handle a single request at a time. Logs collected from that single process could do a good job of describing what happened inside the application.

Then came multi-threaded applications and basic concurrency. A single request would typically be executed by a single thread sequentially, so as long as we included the thread name in the logs and filtered by that name, we could still get a reasonably accurate picture of the request execution.

Then came asynchronous concurrency, with asynchronous and actor-based programming, executor pools, futures, promises, and event-loop-based frameworks. The execution of a single request may start on one thread, then continue on another, then finish on the third. In the case of event loop systems like Node.js, all requests are processed on a single thread but when the execution tries to make an I/O, it is put in a wait state and when the I/O is done, the execution resumes after waiting its turn in the queue.

Both of these asynchronous concurrency models result in each thread switching between multiple different requests that are all in flight. Observing the behavior of such a system from the logs is very difficult, unless we annotate all logs with some kind of unique id representing the request rather than the thread, a technique that actually gets us close to how distributed tracing works.

Finally, microservices introduced what we can call "distributed concurrency." Not only can the execution of a single request jump between threads, but it can also jump between processes, when one microservice makes a network call to another. Trying to troubleshoot request execution from such logs is like debugging without a stack trace: we get small pieces, but no big picture.

In order to reconstruct the flight of the request from the many log streams, we need powerful logs aggregation technology and a distributed context propagation capability to tag all those logs in different processes with a unique request id that we can use to stitch those requests together. We might as well be using the real distributed tracing infrastructure at this point! Yet even after tagging the logs with a unique request id, we still cannot assemble them into an accurate sequence, because the timestamps from different servers are generally not comparable due to clock skews. In Chapter 11, Integration with Metrics and Logs, we will see how tracing infrastructure can be used to provide the missing context to the logs.

 

Distributed tracing


As soon as we start building a distributed system, traditional monitoring tools begin struggling with providing observability for the whole system, because they were designed to observe a single component, such as a program, a server, or a network switch. The story of a single component may no doubt be very interesting, but it tells us very little about the story of a request that touches many of those components. We need to know what happens to that request in all of them, end-to-end, if we want to understand why a system is behaving pathologically. In other words, we first want a macro view.

At the same time, once we get that macro view and zoom in to a particular component that seems to be at fault for the failure or performance problems with our request, we want a micro view of what exactly happened to that request in that component. Most other tools cannot tell that to us either because they only observe what "generally" happens in the component as a whole, for example, how many requests per second it handles (metrics), what events occurred on a given thread (logs), or which threads are on and off CPU at a given point in time (profilers). They don't have the granularity or context to observe a specific request.

Distributed tracing takes a request-centric view. It captures the detailed execution of causally-related activities performed by the components of a distributed system as it processes a given request. In Chapter 3, Distributed Tracing Fundamentals, I will go into more detail on how exactly it works, but in a nutshell:

  • Tracing infrastructure attaches contextual metadata to each request and ensures that metadata is passed around during the request execution, even when one component communicates with another over a network.

  • At various trace points in the code, the instrumentation records events annotated with relevant information, such as the URL of an HTTP request or an SQL statement of a database query.

  • Recorded events are tagged with the contextual metadata and explicit causality references to prior events.

That deceptively simple technique allows the tracing infrastructure to reconstruct the whole path of the request, through the components of a distributed system, as a graph of events and causal edges between them, which we call a "trace." A trace allows us to reason about how the system was processing the request. Individual graphs can be aggregated and clustered to infer patterns of behaviors in the system. Traces can be displayed using various forms of visualizations, including Gantt charts (Figure 1.7) and graph representations (Figure 1.8), to give our visual cortex cues to finding the root cause of performance problems:

Figure 1.7: Jaeger UI view of a single request to the HotROD application, further discussed in chapter 2. In the bottom half, one of the spans (named GetDriver from service redis, with a warning icon) is expanded to show additional information, such as tags and span logs.

Figure 1.8: Jaeger UI view of two traces A and B being compared structurally in the graph form (best viewed in color). Light/dark green colors indicate services that were encountered more/only in trace B, and light/dark red colors indicate services encountered more/only in trace A.

By taking a request-centric view, tracing helps to illuminate different behaviors of the system. Of course, as Bryan Cantrill said in his KubeCon talk, just because we have tracing, it doesn't mean that we eliminated performance pathologies in our applications. We actually need to know how to use it to ask sophisticated questions that we now can ask with this powerful tool. Fortunately, distributed tracing is able to answer all the questions we posed in The observability challenge of microservices section.

 

My experience with tracing


My first experience with distributed tracing was somewhere around 2010, even though we did not use that term at the time. I was working on a trade capture and trade processing system at Morgan Stanley. It was built as a service-oriented architecture (SOA), and the whole system contained more than a dozen different components deployed as independent Java applications. The system was used for over-the-counter interest rate derivatives products (like swaps and options), which had high complexity but not a huge trading volume, so most of the system components were deployed as a single instance, with the exception of the stateless pricers that were deployed as a cluster.

One of the observability challenges with the system was that each trade had to go through a complicated sequence of additional changes, matching, and confirmation flows, implemented by the different components of the system.

To give us visibility into the various state transitions of the individual trades, we used an APM vendor (now defunct) that was essentially implementing a distributed tracing platform. Unfortunately, our experience with that technology was not particularly stellar, with the main challenge being the difficulty of instrumenting our applications for tracing, which involved creating aspect-oriented programming (AOP) - style instructions in the XML files and trying to match on the signature of the internal APIs. The approach was very fragile, as changes to the internal APIs would cause the instrumentation to become ineffective, without good facilities to enforce it via unit testing. Getting instrumentation into existing applications is one of the main difficulties in adopting distributing tracing, as we will discuss in this book.

When I joined Uber in mid-2015, the engineering team in New York had only a handful of engineers, and many of them were working in the metrics system, which later became known as M3. At the time, Uber was just starting its journey towards breaking the existing monolith and replacing it with microservices. The Python monolith, appropriately called "API", was already instrumented with another home-grown tracing-like system called Merckx.

The major shortcoming with Merckx was its design for the days of a monolithic application. It lacked any concept of distributed context propagation. It recorded SQL queries, Redis calls, and even calls to other services, but there was no way to go more than one level deep. It also stored the existing in-process context in a global, thread-local storage, and when many new Python microservices at Uber began adopting an event-loop-based framework Tornado, the propagation mechanism in Merckx was unable to represent the state of many concurrent requests running on the same thread. By the time I joined Uber, Merckx was in maintenance mode, with hardly anyone working on it, even though it had active users. Given the new observability theme of the New York engineering team, I, along with another engineer, Onwukike Ibe, took the mantle of building a fully-fledged distributed tracing platform.

I had no experience with building such systems in the past, but after reading the Dapper paper from Google, it seemed straightforward enough. Plus, there was already an open source clone of Dapper, the Zipkin project, originally built by Twitter. Unfortunately, Zipkin did not work for us out of the box.

In 2014, Uber started building its own RPC framework called TChannel. It did not really become popular in the open source world, but when I was just getting started with tracing, many services at Uber were already using that framework for inter-process communications. The framework came with tracing instrumentation built-in, even natively supported in the binary protocol format. So, we already had traces being generated in production, only nothing was gathering and storing them.

I wrote a simple collector in Go that was receiving traces generated by TChannel in a custom Thrift format and storing them in the Cassandra database in the same format that the Zipkin project used. This allowed us to deploy the collectors alongside the Zipkin UI, and that's how Jaeger was born. You can read more about this in a post on the Uber Engineering blog [9].

Having a working tracing backend, however, was only half of the battle. Although TChannel was actively used by some of the newer services, many more existing services were using plain JSON over HTTP, utilizing many different HTTP frameworks in different programming languages. In some of the languages, for example, Java, TChannel wasn't even available or mature enough. So, we needed to solve the same problem that made our tracing experiment at Morgan Stanley fizzle out: how to get tracing instrumentation into hundreds of existing services, implemented with different technology stacks.

As luck would have it, I was attending one of the Zipkin Practitioners workshops organized by Adrian Cole from Pivotal, the lead maintainer of the Zipkin project, and that same exact problem was on everyone's mind. Ben Sigelman, who founded his own observability company Lightstep earlier that year, was at the workshop too, and he proposed to create a project for a standardized tracing API that could be implemented by different tracing vendors independently, and could be used to create completely vendor-neutral, open source, reusable tracing instrumentation for many existing frameworks and drivers. We brainstormed the initial design of the API, which later became the OpenTracing project [10] (more on that in Chapter 6, Tracing Standards and Ecosystem). All examples in this book use the OpenTracing APIs for instrumentation.

The evolution of the OpenTracing APIs, which is still ongoing, is a topic for another story. Yet even the initial versions of OpenTracing gave us the peace of mind that if we started adopting it on a large scale at Uber, we were not going to lock ourselves into a single implementation. Having different vendors and open source projects participating in the development of OpenTracing was very encouraging. We implemented Jaeger-specific, fully OpenTracing-compatible tracing libraries in several languages (Java, Go, Python, and Node.js), and started rolling them out to Uber microservices. Last time I checked, we had close to 2,400 microservices instrumented with Jaeger.

I have been working in the area of distributed tracing even since. The Jaeger project has grown and matured. Eventually, we replaced the Zipkin UI with Jaeger's own, more modern UI built with React, and in April 2017, we open sourced all of Jaeger, from client libraries to the backend components.

By supporting OpenTracing, we were able to rely on the ever-growing ecosystem of open source instrumentation hosted at the opentracing-contrib organization on GitHub [11], instead of writing our own the way some other projects have done. This freed the Jaeger developers to focus on building a best-of-class tracing backend with data analysis and visualization features. Many other tracing solutions have borrowed features first introduced in Jaeger, just like Jaeger borrowed its initial feature set from Zipkin.

In the fall of 2017, Jaeger was accepted as an incubating project to CNCF, following in the footsteps of the OpenTracing project. Both projects are very active, with hundreds of contributors, and are used by many organizations around the world. The Chinese giant Alibaba even offers hosted Jaeger as part of its Alibaba Cloud services [12]. I probably spend 30-50% of my time at work collaborating with contributors to both projects, including code reviews for pull requests and new feature designs.

 

Why this book?


When I began studying distributed tracing after joining Uber, there was not a lot of information out there. The Dapper paper gave the foundational overview and the technical report by Raja Sambasivan and others [13] provided a very useful historical background. But there was little in the way of a recipe book that would answer more practical questions, such as:

  • Where do I start with tracing in a large organization?

  • How do I drive adoption of tracing instrumentation across existing systems?

  • How does the instrumentation even work? What are the basics? What are the recommended patterns?

  • How do I get the most benefit and return on investment from tracing?

  • What do I do with all that tracing data?

  • How do I operate a tracing backend in real production and not in a toy application?

In the early 2018, I realized that I had pretty good answers to these questions, while most people who were just starting to look into tracing still didn't, and no comprehensive guide has been published anywhere. Even the basic instrumentation steps are often confusing to people if they do not understand the underlying concepts, as evidenced by the many questions posted in the Jaeger and OpenTracing chat rooms.

When I gave the OpenTracing tutorial at the Velocity NYC conference in 2017, I created a GitHub repository that contained step-by-step walkthroughs for instrumentation, from a basic "Hello, World!" program to a small microservices-based application. The tutorials were repeated in several programming languages (I originally created ones for Java, Go, and Python, and later other people created more, for Node.js and C#). I have seen time and again how these most simple tutorials help people to learn the ropes:

Figure 1.9: Feedback about a tutorial

So, I was thinking, maybe I should write a book that would cover not just the instrumentation tutorials, but give a comprehensive overview of the field, from its history and fundamentals to practical advice about where to start and how to get the most benefits from tracing. To my surprise, Andrew Waldron from Packt Publishing reached out to me offering to do exactly that. The rest is history, or rather, this book.

One aspect that made me reluctant to start writing was the fact that the boom of microservices and serverless created a big gap in the observability solutions that can address the challenges posed by these architectural styles, and tracing is receiving a lot of renewed interest, even though the basic idea of distributed tracing systems is not new. Accordingly, there are a lot of changes happening in this area, and there was a risk that anything I wrote would quickly become obsolete. It is possible that in the future, OpenTracing might be replaced by some more advanced API. However, the thought that made me push through was that this book is not about OpenTracing or Jaeger. I use them as examples because they are the projects that are most familiar to me. The ideas and concepts introduced throughout the book are not tied to these projects. If you decide to instrument your applications with Zipkin's Brave library, or with OpenCensus, or even with some vendor's proprietary API, the fundamentals of instrumentation and distributed tracing mechanics are going to be the same, and the advice I give in the later chapters about practical applications and the adoption of tracing will still apply equally.

 

Summary


In this chapter, we took a high-level look at observability problems created by the new popular architectural styles, microservices and FaaS, and discussed why traditional monitoring tools are failing to fill this gap, whereas distributed tracing provides a unique way of getting both a macro and micro view of the system behavior when it executes individual requests.

I have also talked about my own experience and history with tracing, and why I wrote this book as a comprehensive guide to many engineers coming to the field of tracing.

In the next chapter, we are going to take a hands-on deep dive into tracing, by running a tracing backend and a microservices-based demo application. It will complement the claims made in this introduction with concrete examples of the capabilities of end-to-end tracing.

 

References


  1. Martin Fowler, James Lewis. Microservices: a definition of this new architectural term: https://www.martinfowler.com/articles/microservices.html.

  2. Cloud Native Computing Foundation (CNCF) Charter: https://github.com/cncf/foundation/blob/master/charter.md.

  3. CNCF projects: https://www.cncf.io/projects/.

  4. Bryan Cantrill. Visualizing Distributed Systems with Statemaps. Observability Practitioners Summit at KubeCon/CloudNativeCon NA 2018, December 10: https://sched.co/HfG2.

  5. Vijay Gill. The Only Good Reason to Adopt Microservices: https://lightstep.com/blog/the-only-good-reason-to-adopt-microservices/.

  6. Global Microservices Trends Report: https://go.lightstep.com/global-microservices-trends-report-2018.

  7. Benjamin H. Sigelman, Luiz A. Barroso, Michael Burrows, Pat Stephenson, Manoj Plakal, Donald Beaver, Saul Jaspan, and Chandan Shanbhag. Dapper, a large-scale distributed system tracing infrastructure. Technical Report dapper-2010-1, Google, April 2010.

  8. Ben Sigelman. Keynote: OpenTracing and Containers: Depth, Breadth, and the Future of Tracing. KubeCon/CloudNativeCon North America, 2016, Seattle: https://sched.co/8fRU.

  9. Yuri Shkuro. Evolving Distributed Tracing at Uber Engineering. Uber Eng Blog, February 2, 2017: https://eng.uber.com/distributed-tracing/.

  10. The OpenTracing Project: http://opentracing.io/.

  11. The OpenTracing Contributions: https://github.com/opentracing-contrib/.

  12. Alibaba Cloud documentation. OpenTracing implementation of Jaeger: https://www.alibabacloud.com/help/doc-detail/68035.htm.

  13. Raja R. Sambasivan, Rodrigo Fonseca, Ilari Shafer, Gregory R. Ganger. So, You Want to Trace Your Distributed System? Key Design Insights from Years of Practical Experience. Carnegie Mellon University Parallel Data Lab Technical Report CMU-PDL-14-102. April 2014.

About the Author
  • Yuri Shkuro

    Dr. Yuri Shkuro holds a Ph.D. in Computer Science from University of Maryland, College Park, and a Masters degree in Computer Engineering from MEPhI (Moscow Engineering & Physics Institute), one of Russia's top three universities. Yuri is the author of many academic papers in the area of machine learning and neural networks, which have been cited in over 130 other publications. Yuris impressive career includes 15 years on Wall Street, building trading and risk management systems for derivatives at top investment banks, Goldman Sacks, JPMorgan Chase, and Morgan Stanley, and over three years on Ubers Infrastructure & Observability team. His open-source credentials include being a co-founder and a member of the Specification Council of the OpenTracing project, and the creator and current tech lead of Jaeger, a distributed tracing system developed at Uber. Both projects are incubating at the Cloud Native Computing Foundation. Outside of his academic and professional career, Yuri was an editor and co-producer of several animated shorts directed by Lev Polyakov, including:Only Love (2008), which screened at over 30 film festivals and won several awardsPiper the Goat and the Peace Pipe (2005), a winner at the Ottawa International Animation Festival

    Browse publications by this author
Latest Reviews (5 reviews total)
Very interesting and well written
It's a good book, with in-depth information about the design of distributed tracing systems. There are also good explanations of the OpenTracing concepts and code examples which are clear and easy to follow. I would have hoped to find more in-depth advice also about operating tracing systems in production.
Very easy to find and purchase a book.
Mastering Distributed Tracing
Unlock this book and the full library FREE for 7 days
Start now