The term observability has only been around in the software industry for a short time, but the concepts and goals it represents have been around for much longer. Indeed, ever since the earliest days of computing, programmers have been trying to answer the question: is the system doing what I think it should be?
For some, observability consists of buying a one-size-fits-all solution that includes logs, metrics, and traces, then configuring some off-the-shelf integrations and calling it a day. These tools can be used to increase visibility into a piece of software's behavior by providing mechanisms to produce and collect telemetry. The following are some examples of telemetry that can be added to a system:
- Keeping a count of the number of requests received
- Adding a log entry when an event occurs
- Recording a value for current memory consumption on a machine
- Tracing a request from a client all the way to a backend service
However, producing high-quality telemetry is only one part of the observability challenge. The other part is ensuring that events occurring across the different types of telemetry can be correlated in meaningful ways during analysis. The goal of observability is to answer questions that you may have about the system:
- If a problem occurred in production, what evidence would you have to be able to identify it?
- Why is this service suddenly overwhelmed when it was fine just a minute ago?
- If a specific condition from a client triggers an anomaly in some underlying service, would you know it without customers or support calling you?
These are some of the questions that the domain of observability can help answer. Observability is about empowering the people who build and operate distributed applications to understand their code's behavior while running in production. In this chapter, we will explore the following:
- Understanding cloud-native applications
- Looking at the shift to DevOps
- Reviewing the history of observability
- Understanding the history of OpenTelemetry
- Understanding the concepts of OpenTelemetry
Before we begin looking at the history of observability, it's important to understand the changes in the software industry that have led to the need for observability in the first place. Let's start with the shift to the cloud.
Understanding cloud-native applications
The way applications are built and deployed has drastically changed in the past few years with the increased adoption of the internet. An unprecedented increase in demand for services (for example, streaming media, social networks, and online shopping) powered by software has raised expectations for those services to be readily available. In addition, this increase in demand has fueled the need for developers to be able to scale their applications quickly. Cloud providers, such as Microsoft, Google, and Amazon, offer infrastructure to run applications at the click of a button and at a fraction of the cost, and reduce the risk of deploying servers in traditional data centers. This enables developers to experiment more freely and reach a wider audience. Alongside this infrastructure, these cloud providers also offer managed services for databases, networking infrastructure, message queues, and many other services that, in the past, organizations would control internally.
One of the advantages these cloud-based providers offer is freeing up organizations to focus on the code that matters to their businesses. This replaces costly and time-consuming hardware implementations, or operating services they lack expertise in. To take full advantage of cloud platforms, developers started looking at how applications that were originally developed as monoliths could be re-architected to take advantage of cloud platforms. The following are challenges that could be encountered when deploying monoliths to a cloud provider:
- Scaling a monolith is traditionally done by increasing the number of resources available to the monolith, also known as vertical scaling. Vertically scaling applications can only go as far as the largest available resource offered by a cloud provider.
- Improving the reliability of a monolith means deploying multiple instances to handle multiple failures, thus avoiding downtime. This is also known as horizontal scaling. Depending on the size of the monolith, this could quickly ramp up costs. This can also be wasteful if not all components of the monolith need to be replicated.
The specific challenges of building applications on cloud platforms have led developers to increasingly adopt a service-oriented architecture, or microservice architecture, that organizes applications as loosely coupled services, each with limited scope. The following figure shows a monolith architecture on the left, where all the services in the application are tightly coupled and operate within the same boundary. In contrast, the microservices architecture on the right shows us that the services are loosely coupled, and each service operates independently:
Applications built using microservices architecture provide developers with the ability to scale only the components needed to handle the additional load, meaning horizontal scaling becomes a much more attractive option. As it often does, a new architecture comes with its own set of trade-offs and challenges. The following are some of the new challenges cloud-native architecture presents that did not exist in traditional monolithic systems:
- Latency introduced where none existed before, causing applications to fail in unexpected ways.
- Dependencies can and will fail, so applications must be built defensively to minimize cascading failures.
- Managing configuration and secrets across services is difficult.
- Service orchestration becomes complex.
With this change in architecture, the scope of each application is reduced significantly, making it easier to understand the needs of scaling each component. However, the increased number of independent services and added complexity also creates challenges for traditional operations (ops) teams, meaning organizations would also need to adapt.
Looking at the shift to DevOps
The shift to microservices has, in turn, led to a shift in how development teams are organized. Instead of a single large team managing a monolithic application, many teams each manage their own microservices. In traditional software development, a software development team would normally hand off the software once it was deemed complete. The handoff would be to an operations team, who would deploy the software and operate it in a production environment. As the number of services and teams grew, organizations found themselves growing their operations teams to unmanageable sizes, and quite often, those teams were still unable to keep up with the demands of the changing software.
This, in turn, led to an explosion of development teams that began the transition from the traditional development and operations organization toward the use of new hybrid DevOps teams. Using the DevOps approach, development teams write, test, build, package, deploy, and operate the code they develop. This ownership of the code through all stages of its life cycle empowers many developers and organizations to accelerate their feature development. This approach, of course, comes with different challenges:
- Increased dependencies across development teams mean it's possible that no one has a full picture of the entire application.
- Keeping track of changes across an organization can be difficult. This makes the answer to the "what caused this outage?" question more challenging to find.
Individual teams must become familiar with many more tools. This can lead to too much focus on the tools themselves, rather than on their purpose. The quick adoption of DevOps creates a new problem. Without the right amount of visibility across the systems managed by an organization, teams are struggling to identify the root causes of issues encountered. This can lead to longer and more frequent outages, severely impacting the health and happiness of people across organizations. Let's look at how the methods of observing systems have evolved to adapt to this changing landscape.
Reviewing the history of observability
In many ways, being able to understand what a computer is doing is both fun and challenging when working with software. The ability to understand how systems are behaving has gone through quite a few iterations since the early 2000s. Many different markets have been created to solve this need, such as systems monitoring, log management, and application performance monitoring. As is often the case, when new challenges come knocking, the doors of opportunity open to those willing to tackle those challenges. Over the same period, countless vendors and open source projects have sprung up to help people who are building and operating services in managing their systems. The term observability, however, is a recent addition to the software industry and comes from control theory.
Wikipedia (https://en.wikipedia.org/wiki/Observability) defines observability as:
Observability is an evolution of its predecessors, built on lessons learned through years of experience and trial and error. To better understand where observability is today, it's important to understand where some of the methods used today by cloud-native application developers come from, and how they have changed over time. We'll start by looking at the following:
- Centralized logging
- Metrics and dashboards
- Tracing and analysis
One of the first pieces of software a programmer writes when learning a new language is a form of observability: "Hello, World!". Printing some text to the terminal is usually one of the quickest ways to provide users with feedback that things are working, and that's why "Hello, World" has been a tradition in computing since the late 1960s.
One of my favorite methods for debugging is still to add print statements across the code when things aren't working. I've even used this method to troubleshoot an application distributed across multiple servers before, although I can't say it was my proudest moment, as it caused one of our services to go down temporarily because of a typo in an unfamiliar editor. Print statements are great for simple debugging, but unfortunately, this only scales so far.
Once an application is large enough or distributed across enough systems, searching through the logs on individual machines is not practical. Applications can also run on ephemeral machines that may no longer be present when we need those logs. Combined, all of this created a need to make the logs available in a central location for persistent storage and searchability, and thus centralized logging was born.
There are many available vendors that provide a destination for logs, as well as features around searching, and alerting based on those logs. There are also many open source projects that have tried to tackle the challenges of standardizing log formats, providing mechanisms for transport, and storing the logs. The following are some of these projects:
- Fluentd – https://www.fluentd.org
- Logstash – https://github.com/elastic/logstash
- Apache Flume – https://flume.apache.org
Centralized logging additionally provides the opportunity to produce metrics about the data across the entire system.
Using metrics and dashboards
Metrics are possibly the most well-known of the tools available in the observability space. Think of the temperature in a thermometer, the speed on the odometer of a car, or the time on a watch. We humans love measuring and quantifying things. From the early days of computing, being able to keep track of how resources were utilized was critical in ensuring that multi-user environments provided a good user experience for all users of the system.
Nowadays, measuring application and system performance via the collection of metrics is common practice in software development. This data is converted into graphs to generate meaningful visualizations for those in charge of monitoring the health of a system.
These metrics can also be used to configure alerting when certain thresholds have been reached, such as when an error rate becomes greater than an acceptable percentage. In certain environments, metrics are used to automate workflows as a reaction to changes in the system, such as increasing the number of application instances or rolling back a bad deployment. As with logging, over time, many vendors and projects provided their own solutions to metrics, dashboards, monitoring, and alerting. Some of the open source projects that focus on metrics are as follows:
- Prometheus – https://prometheus.io
- StatsD – https://github.com/statsd/statsd
- Graphite – https://graphiteapp.org
- Grafana – https://github.com/grafana/grafana
Let's now look at tracing and analysis.
Applying tracing and analysis
Tracing applications means having the ability to run through the application code and ensure it's doing what is expected. This can often, but not always, be achieved in development using a debugger such as GDB (https://www.gnu.org/software/gdb/) or PDB (https://docs.python.org/3/library/pdb.html) in Python. This becomes impossible when debugging an application that is spread across multiple services on different hosts across a network. Researchers at Google published a white paper on a large-scale distributed tracing system built internally: Dapper (https://research.google/pubs/pub36356/). In this paper, they describe the challenges of distributed systems, as well as the approach that was taken to address the problem. This research is the basis of distributed tracing as it exists today. After the paper was published, several open source projects sprung up to provide users with the tools to trace and visualize applications using distributed tracing:
- OpenTracing – https://opentracing.io
- OpenCensus – https://opencensus.io
- Zipkin – https://zipkin.io
- Jaeger – https://www.jaegertracing.io
As you can imagine, with so many tools, it can be daunting to even know where to begin on the journey to making a system observable. Users and organizations must spend time and effort upfront to even get started. This can be challenging when other deadlines are looming. Not only that, but the time investment needed to instrument an application can be significant depending on the complexity of the application, and the return on that investment sometimes isn't made clear until much later. The time and money invested, as well as the expertise required, can make it difficult to change from one tool to another if the initial implementation no longer fits your needs as the system evolves.
Such a wide array of methods, tools, libraries, and standards has also caused fragmentation in the industry and the open source community. This has led to libraries supporting one format or another. This leaves it up to the user to fix any gaps within the environments themselves. This also means there is effort required to maintain feature parity across different projects. All of this could be addressed by bringing the people working in these communities together.
With a better understanding of different tools at the disposal of application developers, their evolution, and their role, we can start to better appreciate the scope of what OpenTelemetry is trying to solve.
Understanding the history of OpenTelemetry
In early 2019, the OpenTelemetry project was announced as a merger of two existing open source projects: OpenTracing and OpenCensus. Although initially, the goal of this endeavor was to bring these two projects together, its ambition to provide an observability framework for cloud-native software goes much further than that. Since OpenTelemetry combines concepts of both OpenTracing and OpenCensus, let's first look at each of these projects individually. Please refer to the following Twitter link, which announced OpenTelemetry by combining both concepts:
The OpenTracing (https://opentracing.io) project, started in 2016, was focused on solving the problem of increasing the adoption of distributed tracing as a means for users to better understand their systems. One of the challenges identified by the project was that adoption was difficult because of cost instrumentation and the lack of consistent quality instrumentation in third-party libraries. OpenTracing provided a specification for Application Programming Interface (APIs) to address this problem. This API could be leveraged independently of the implementation that generated distributed traces, therefore allowing application developers and library authors to embed calls to this API in their code. By default, the API would act as a no-op operation, meaning those calls wouldn't do anything unless an implementation was configured.
Let's see what this looks like in code. The call to an API to trace a specific piece of code resembles the following example. You'll notice the code is accessing a global variable to obtain a Tracer via the
global_tracer method. A Tracer in OpenTracing, and in OpenTelemetry (as we'll discuss later in Chapter 2, OpenTelemetry Signals – Tracing, Metrics, and Logging, and Chapter 4, Distributed Tracing – Tracing Code Execution), is a mechanism used to generate trace data. Using a globally configured tracer means that there's no configuration required in this instrumentation code – it can be done completely separately. The next line starts aprimary building block,
span. We'll discuss this further in Chapter 2, OpenTelemetry Signals – Tracing, Metrics, and Logging, but it is shown here to give you an idea of how a Tracer is used in practice:
import opentracing tracer = opentracing.global_tracer() with tracer.start_active_span('doWork'): # do work
The default no-op implementation meant that code could be instrumented without the authors having to make decisions about how the data would be generated or collected at instrumentation time. It also meant that users of instrumented libraries, who didn't want to use distributed tracing in their applications, could still use the library without incurring a performance penalty by not configuring it. On the other hand, users who wanted to configure distributed tracing could choose how this information would be generated. The users of these libraries and applications would choose a Tracer implementation and configure it. To comply with the specification, a Tracer implementation only needed to adhere to the API defined (https://github.com/opentracing/opentracing-python/blob/master/opentracing/tracer.py) , which includes the following methods:
- Start a new span.
- Inject an existing span's context into a carrier.
- Extract an existing span from a carrier.
Along with the specification for this API, OpenTracing also provides semantic conventions. These conventions describe guidelines to improve the quality of the telemetry emitted by instrumenting. We'll discuss semantic conventions further when exploring the concepts of OpenTelemetry.
OpenCensus (https://opencensus.io) started as an internal project at Google, called Census, but was open sourced and gained popularity with a wider community in 2017. The project provided libraries to make the generation and collection of both traces and metrics simpler for application developers. It also provided the OpenCensus Collector, an agent run independently that acted as a destination for telemetry from applications and could be configured to process the data before sending it along to backends for storage and analysis. Telemetry being sent to the collector was transmitted using a wire format specified by OpenCensus. The collector was an especially powerful component of OpenCensus. As shown in Figure 1.3, many applications could be configured to send data to a single destination. That destination could then control the flow of the data without having to modify the application code any further:
The concepts of the API to support distributed tracing in OpenCensus were like those of OpenTracing's API. In contrast to OpenTracing, however, the project provided a tightly coupled API and Software Development Kit (SDK), meaning users could use OpenCensus without having to install and configure a separate implementation. Although this simplified the user experience for application developers, it also meant that in certain languages, the authors of third-party libraries wanting to instrument their code would depend on the SDK and all its dependencies. As mentioned before, OpenCensus also provided an API to generate application metrics. It introduced several concepts that would become influential in OpenTelemetry:
- Measurement: This is the recorded output of a measure, or a generated metric point.
- Measure: This is a defined metric to be recoded.
- Aggregation: This describes how the measurements are aggregated.
- Views: These combine measures and aggregations to determine how the data should be exported.
To collect metrics from their applications, developers defined a measure instrument to record measurements, and then configured a view with an aggregation to emit the data to a backend. The supported aggregations were count, distribution, sum, and last value.
As the two projects gained popularity, the pain for users only grew. The existence of both projects meant that it was unclear for users what project they should rely on. Using both together was not easy. One of the core components of distributed tracing is the ability to propagate context between the different applications in a distributed system, and this didn't work out of the box between the two projects. If a user wanted to collect traces and metrics, they would have to use OpenCensus, but if they wanted to use libraries that only supported OpenTracing, then they would have to use both – OpenTracing for distributed traces, and OpenCensus for metrics. It was a mess, and when there are too many standards, the way to solve all the problems is to invent a new standard!
It was a mess, and when there are too many standards, the way to solve all the problems is to invent a new standard! The following XKCD comic captures the sentiment very aptly:
Sometimes a new standard is a correct solution, especially when that solution:
- Is built using the lessons learned from its predecessors
- Brings together the communities behind other standards
- Supersedes two existing competing standards
The OpenCensus and OpenTracing organizers worked together to ensure the new standard would support a migration path for existing users of both communities, allowing the projects to eventually become deprecated. This would also make the lives of users easier by offering a single standard to use when instrumenting applications. There was no longer any need to guess what project to use!
Observability for cloud-native software
OpenTelemetry aims to standardize how applications are instrumented and how telemetry data is generated, collected, and transmitted. It also aims to give users the tools necessary to correlate that telemetry across systems, languages, and applications, to allow them to better understand their software. One of the initial goals of the project involved ensuring all the functionality that was key to both OpenCensus and OpenTracing users would become part of the new project. The focus on pre-existing users also leads to the project organizers establishing a migration path to ease the transition from OpenTracing and OpenCensus to OpenTelemetry. To accomplish its lofty goals, OpenTelemetry provides the following:
- An open specification
- Language-specific APIs and SDKs
- Instrumentation libraries
- Semantic conventions
- An agent to collect telemetry
- A protocol to organize, transmit, and receive the data
The project kicked off with the initial commit on May 1, 2019, and brought together the leaders from OpenCensus and OpenTracing. The project is governed by a governance committee that holds elections annually, with elected representatives serving on the committee for two-year terms. The project also has a technical committee that oversees the specification, drives project-wide discussion, and reviews language-specific implementations. In addition, there are various special interest groups (SIGs) in the project, focused on features or technologies supported by the project. Each language implementation has its own SIG with independent maintainers and approvers managing separate repositories with tools and processes tailored to the language. The initial work for the project was heavily focused on the open specification. This provides guidance for the language-specific implementations. Since its first commit, the project has received contributions from over 200 organizations, including observability leaders and cloud providers, as well as end users of OpenTelemetry. At the time of writing, OpenTelemetry has implementations in 11 languages and 18 special interest or working groups.
Since the initial merger of OpenCensus and OpenTracing, communities from additional open source projects have participated in OpenTelemetry efforts, including members of the Prometheus and OpenMetrics projects. Now that we have a better understanding of how OpenTelemetry was brought to life, let's take a deeper look at the concepts of the project.
Understanding the concepts of OpenTelemetry
OpenTelemetry is a large ecosystem. Before diving into the code, having a general understanding of the concepts and terminology used in the project will help us. The project is composed of the following:
- Context propagation
Let's look at each of these aspects.
With its goal of providing an open specification for encompassing such a wide variety of telemetry data, the OpenTelemetry project needed to agree on a term to organize the categories of concern. Eventually, it was decided to call these signals. A signal can be thought of as a standalone component that can be configured, providing value on its own. The community decided to align its work into deliverables around these signals to deliver value to its users as soon as possible. The alignment of the work and separation of concerns in terms of signals has allowed the community to focus its efforts. The tracing and baggage signals were released in early 2021, soon followed by the metrics signal. Each signal in OpenTelemetry comes with the following:
- A set of specification documents providing guidance to implementors of the signal
- A data model expressing how the signal is to be represented in implementations
- An API that can be used by application and library developers to instrument their code
- The SDK needed to allow users to produce telemetry using the APIs
- Semantic conventions that can be used to get consistent, high-quality data
- Instrumentation libraries to simplify usage and adoption
The initial signals defined by OpenTelemetry were tracing, metrics, logging, and baggage. Signals are a core concept of OpenTelemetry and, as such, we will become quite familiar with them.
One of the most important aspects of OpenTelemetry is ensuring that users can expect a similar experience regardless of the language they're using. This is accomplished by defining the standards for what is expected of OpenTelemetry-compliant implementations in an open specification. The process used for writing the specification is flexible, but large new features or sections of functionality are often proposed by writing an OpenTelemetry Enhancement Proposal (OTEP). The OTEP is submitted for review and is usually provided along with prototype code in multiple languages, to ensure the proposal isn't too language-specific. Once an OTEP is approved and merged, the writing of the specification begins. The entire specification lives in a repository on GitHub (https://github.com/open-telemetry/opentelemetry-specification) and is open for anyone to contribute or review.
The data model defines the representation of the components that form a specific signal. It provides the specifics of what fields each component must have and describes how all the components interact with one another. This piece of the signal definition is particularly important to give clarity as to what use cases the APIs and SDKs will support. The data model also explains to developers implementing the standard how the data should behave.
Instrumenting applications can be quite expensive, depending on the size of your code base. Providing users with an API allows them to go through the process of instrumenting their code in a way that is vendor-agnostic. The API is decoupled from the code that generates the telemetry, allowing users the flexibility to swap out the underlying implementations as they see fit. This interface can also be relied upon by library and frameworks authors, and only configured to emit telemetry data by end users who wish to do so. A user who instruments their code by using the API and does not configure the SDK will not see any telemetry produced by design.
The SDK does the bulk of the heavy lifting in OpenTelemetry. It implements the underlying system that generates, aggregates, and transmits telemetry data. The SDK provides the controls to configure how telemetry should be collected, where it should be transmitted, and how. Configuration of the SDK is supported via in-code configuration, as well as via environment variables defined in the specification. As it is decoupled from the API, using the SDK provided by OpenTelemetry is an option for users, but it is not required. Users and vendors are free to implement their own SDKs if doing so will better fit their needs.
Producing telemetry can be a daunting task, since you can call anything whatever you wish, but doing so would make analyzing this data difficult. For example, if server A labels the duration of an
http.server.duration request and server B labels it
http.server.request_length, calculating the total duration of a request across both servers requires additional knowledge of this difference, and likely additional operations. One way in which OpenTelemetry tries to make this a bit easier is by offering semantic conventions, or definitions for different types of applications and workloads to improve the consistency of telemetry. Some of the types of applications or protocols that are covered by semantic conventions include the following:
- Message queues
- Function-as-a-Service (FaaS)
- Remote procedure calls (RPC)
- Process metrics
The full list of semantic conventions is quite extensive and can be found in the specification repository. The following figure shows a sample of the semantic convention for tracing database queries:
The consistency of telemetry data reported will ultimately impact the user of that data's ability to use this information. Semantic conventions provide both the guidelines of what telemetry should be reported, as well as how to identify this data. They provide a powerful tool for developers to learn their way around observability.
To ensure users can get up and running quickly, instrumentation libraries are made available by OpenTelemetry SIGs in various languages. These libraries provide instrumentation for popular open source projects and frameworks. For example, in Python, the instrumentation libraries include Flask, Requests, Django, and others. The mechanisms used to implement these libraries are language-specific and may be used in combination with auto-instrumentation to provide users with telemetry with close to zero code changes required. The instrumentation libraries are supported by the OpenTelemetry organization and adhere to semantic conventions.
Signals represent the core of the telemetry data that is generated by instrumenting cloud-native applications. They can be used independently, but the real power of OpenTelemetry is to allow its users to correlate data across signals to get a better understanding of their systems. Now that we have a general understanding of what they are, let's look at the other concepts of OpenTelemetry.
To be useful, the telemetry data captured by each signal must eventually be exported to a data store, where storage and analysis can occur. To accomplish this, each signal implementation offers a series of mechanisms to generate, process, and transmit telemetry. We can think of this as a pipeline, as represented in the following figure:
The components in the telemetry pipeline are typically initialized early in the application code to ensure no meaningful telemetry is missed.
In many languages, the pipeline is configurable via environment variables. This will be explored further in Chapter 7, Instrumentation Libraries.
Once configured, the application generally only needs to interact with the generator to record telemetry, and the pipeline will take care of collecting and sending the data. Let's look at each component of the pipeline now.
The starting point of the telemetry pipeline is the provider. A provider is a configurable factory that is used to give application code access to an entity used to generate telemetry data. Although multiple providers may be configured within an application, a default global provider may also be made available via the SDK. Providers should be configured early in the application code, prior to any telemetry data being generated.
To generate telemetry at different points in the code, the telemetry generator instantiated by a provider is made available in the SDK. This generator is what most users will interact with through the instrumentation of their application and the use of the API. Generators are named differently depending on the signal: the tracing signal calls this a tracer, the metrics signal a meter. Their purpose is generally the same – to generate telemetry data. When instantiating a generator, applications and instrumenting libraries must pass a name to the provider. Optionally, users can specify a version identifier to the provider as well. This information will be used to provide additional information in the telemetry data generated.
Once the telemetry data has been generated, processors provide the ability to further modify the contents of the data. Processors may determine the frequency at which data should be processed or how the data should be exported. When instantiating a generator, applications and instrumenting libraries must pass a name to the provider. Optionally, users can specify a version identifier to the provider as well.
The last step before telemetry leaves the context of an application is to go through the exporter. The job of the exporter is to translate the internal data model of OpenTelemetry into the format that best matches the configured exporter's understanding. Multiple export formats and protocols are supported by the OpenTelemetry project:
- OpenTelemetry protocol
The pipeline allows telemetry data to be produced and emitted. We'll configure pipelines many times over the following chapters, and we'll see how the flexibility provided by the pipeline accommodates many use cases.
At their most basic, resources can be thought of as a set of attributes that are applied to different signals. Conceptually, a resource is used to identify the source of the telemetry data, whether a machine, container, or function. This information can be used at the time of analysis to correlate different events occurring in the same resource. Resource attributes are added to the telemetry data from signals at the export time before the data is emitted to a backend. Resources are typically configured at the start of an application and are associated with the providers. They tend to not change throughout the lifetime of the application. Some typical resource attributes would include the following:
- A unique name for the service:
- The version identifier for a service:
- The name of the host where the service is running:
Additionally, the specification defines resource detectors to further enrich the data. Although resources can be set manually, resource detectors provide convenient mechanisms to automatically populate environment-specific data. For example, the Google Cloud Platform (GCP) resource detector (https://www.npmjs.com/package/@opentelemetry/resource-detector-gcp) interacts with the Google API to fill in the following data:
Resources and resource detectors adhere to semantic conventions. Resources are a key component in making telemetry data-rich, meaningful, and consistent across an application. Another important aspect of ensuring the data is meaningful is context propagation.
One area of observability that is particularly powerful and challenging is context propagation. A core concept of distributed tracing, context propagation provides the ability to pass valuable contextual information between services that are separated by a logical boundary. Context propagation is what allows distributed tracing to tie requests together across multiple systems. OpenTelemetry, as OpenTracing did before it, has made this a core component of the project. In addition to tracing, context propagation allows for user-defined values (known as baggage) to be propagated. Baggage can be used to annotate telemetry across signals.
Context propagation defines a context API as part of the OpenTelemetry specification. This is independent of the signals that may use it. Some languages already have built-in context mechanisms, such as the
ContextVar module in Python 3.7+ and the
context package in Go. The specification recommends that the context API implementations leverage these existing mechanisms. OpenTelemetry also provides for the interface and implementation of mechanisms required to propagate context across boundaries. The following abbreviated code shows how two services, A and B, would use the context API to share context:
from opentelemetry.propagate import extract, inject class ServiceA: def client_request(): inject(headers, context=current_context) # make a request to ServiceB and pass in headers class ServiceB: def handle_request(): # receive a request from ServiceA context = extract(headers)
In Figure 1.6, we can see a comparison between two requests from service A to service B. The top request is made without propagating the context, with the result that service B has neither the trace information nor the baggage that service A does. In the bottom request, this contextual data is injected when service A makes a request to service B, and extracted by service B from the incoming request, ensuring service B now has access to the propagated data:
The propagation of context we have demonstrated allows backends to tie the two sides of the request together, but it also allows service B to make use of the dataset in service A. The challenge with context propagation is that when it isn't working, it's hard to know why. The issue could be that the context isn't being propagated correctly due to configuration issues or possibly a networking problem. This is a concept we'll revisit many times throughout the book.
In this chapter, we've looked at what observability is, and the challenges it can solve as regards the use of cloud-native applications. By exploring the different mechanisms available to generate telemetry and improve the observability of applications, we were also able to gain an understanding of how the observability landscape has evolved, as well as where some challenges remain.
Exploring the history behind the OpenTelemetry project gave us an understanding of the origin of the project and its goals. We then familiarized ourselves with the components forming tracing, metrics, logging signals, and pipelines to give us the terminology and building blocks needed to start producing telemetry using OpenTelemetry. This learning will allow us to tackle the first challenge of observability – producing high-quality telemetry. Understanding resources and context propagation will help us correlate events across services and signals to allow us to tackle the second challenge – connecting the data to better understand systems.
Let's now take a closer look at how this all works together in practice. In the next chapter, we will dive deeper into the concepts of distributed tracing, metrics, logs, and semantic conventions by launching a grocery store application instrumented with OpenTelemetry. We will then explore the telemetry generated by this distributed system.