Home

Observability with Grafana

By Rob Chapman , Peter Holmes

Book

eBook $39.99 $27.98

Print $49.99

Subscription $15.99 $10 p/m for three months

BUY NOW

$10 p/m for first 3 months. $15.99 p/m after that. Cancel Anytime!

What do you get with a Packt Subscription?

This book & 7000+ ebooks & video courses on 1000+ technologies

60+ curated reading lists for various learning paths

50+ new titles added every month on new and emerging tech

Early Access to eBooks as they are being written

Personalised content suggestions

Customised display settings for better reading experience

50+ new titles added every month on new and emerging tech

Playlists, Notes and Bookmarks to easily manage your learning

Mobile App with offline access

What do you get with a Packt Subscription?

This book & 6500+ ebooks & video courses on 1000+ technologies

60+ curated reading lists for various learning paths

50+ new titles added every month on new and emerging tech

Early Access to eBooks as they are being written

Personalised content suggestions

Customised display settings for better reading experience

50+ new titles added every month on new and emerging tech

Playlists, Notes and Bookmarks to easily manage your learning

Mobile App with offline access

What do you get with eBook + Subscription?

Download this book in EPUB and PDF formats, plus a monthly download credit

This book & 6500+ ebooks & video courses on 1000+ technologies

60+ curated reading lists for various learning paths

50+ new titles added every month on new and emerging tech

Early Access to eBooks as they are being written

Personalised content suggestions

Customised display settings for better reading experience

50+ new titles added every month on new and emerging tech

Playlists, Notes and Bookmarks to easily manage your learning

Mobile App with offline access

What do you get with a Packt Subscription?

This book & 6500+ ebooks & video courses on 1000+ technologies

60+ curated reading lists for various learning paths

50+ new titles added every month on new and emerging tech

Early Access to eBooks as they are being written

Personalised content suggestions

Customised display settings for better reading experience

50+ new titles added every month on new and emerging tech

Playlists, Notes and Bookmarks to easily manage your learning

Mobile App with offline access

What do you get with eBook?

Download this book in EPUB and PDF formats

Access this title in our online reader

DRM FREE - Read whenever, wherever and however you want

Online reader with customised display settings for better reading experience

What do I get with Print?

Get a paperback copy of the book delivered to your specified Address*

Download this book in EPUB and PDF formats

Access this title in our online reader

DRM FREE - Read whenever, wherever and however you want

Online reader with customised display settings for better reading experience

What do I get with Print?

Get a paperback copy of the book delivered to your specified Address*

Access this title in our online reader

Online reader with customised display settings for better reading experience

What do you get with video?

Download this video in MP4 format

Access this title in our online reader

DRM FREE - Watch whenever, wherever and however you want

Online reader with customised display settings for better learning experience

What do you get with video?

Stream this video

Access this title in our online reader

DRM FREE - Watch whenever, wherever and however you want

Online reader with customised display settings for better learning experience

What do you get with Audiobook?

Download a zip folder consisting of audio files (in MP3 Format) along with supplementary PDF

What do you get with Exam Trainer?

Flashcards, Mock exams, Exam Tips, Practice Questions

Access these resources with our interactive certification platform

Mobile compatible-Practice whenever, wherever, however you want

BUY NOW $10 p/m for first 3 months. $15.99 p/m after that. Cancel Anytime!

eBook $39.99 $27.98

Print $49.99

Subscription $15.99 $10 p/m for three months

What do you get with a Packt Subscription?

This book & 7000+ ebooks & video courses on 1000+ technologies

60+ curated reading lists for various learning paths

50+ new titles added every month on new and emerging tech

Early Access to eBooks as they are being written

Personalised content suggestions

Customised display settings for better reading experience

50+ new titles added every month on new and emerging tech

Playlists, Notes and Bookmarks to easily manage your learning

Mobile App with offline access

What do you get with a Packt Subscription?

This book & 6500+ ebooks & video courses on 1000+ technologies

60+ curated reading lists for various learning paths

50+ new titles added every month on new and emerging tech

Early Access to eBooks as they are being written

Personalised content suggestions

Customised display settings for better reading experience

50+ new titles added every month on new and emerging tech

Playlists, Notes and Bookmarks to easily manage your learning

Mobile App with offline access

What do you get with eBook + Subscription?

Download this book in EPUB and PDF formats, plus a monthly download credit

This book & 6500+ ebooks & video courses on 1000+ technologies

60+ curated reading lists for various learning paths

50+ new titles added every month on new and emerging tech

Early Access to eBooks as they are being written

Personalised content suggestions

Customised display settings for better reading experience

50+ new titles added every month on new and emerging tech

Playlists, Notes and Bookmarks to easily manage your learning

Mobile App with offline access

What do you get with a Packt Subscription?

This book & 6500+ ebooks & video courses on 1000+ technologies

60+ curated reading lists for various learning paths

50+ new titles added every month on new and emerging tech

Early Access to eBooks as they are being written

Personalised content suggestions

Customised display settings for better reading experience

50+ new titles added every month on new and emerging tech

Playlists, Notes and Bookmarks to easily manage your learning

Mobile App with offline access

What do you get with eBook?

Download this book in EPUB and PDF formats

Access this title in our online reader

DRM FREE - Read whenever, wherever and however you want

Online reader with customised display settings for better reading experience

What do I get with Print?

Get a paperback copy of the book delivered to your specified Address*

Download this book in EPUB and PDF formats

Access this title in our online reader

DRM FREE - Read whenever, wherever and however you want

Online reader with customised display settings for better reading experience

What do I get with Print?

Get a paperback copy of the book delivered to your specified Address*

Access this title in our online reader

Online reader with customised display settings for better reading experience

What do you get with video?

Download this video in MP4 format

Access this title in our online reader

DRM FREE - Watch whenever, wherever and however you want

Online reader with customised display settings for better learning experience

What do you get with video?

Stream this video

Access this title in our online reader

DRM FREE - Watch whenever, wherever and however you want

Online reader with customised display settings for better learning experience

What do you get with Audiobook?

Download a zip folder consisting of audio files (in MP3 Format) along with supplementary PDF

What do you get with Exam Trainer?

Flashcards, Mock exams, Exam Tips, Practice Questions

Access these resources with our interactive certification platform

Mobile compatible-Practice whenever, wherever, however you want

About this book

To overcome application monitoring and observability challenges, Grafana Labs offers a modern, highly scalable, cost-effective Loki, Grafana, Tempo, and Mimir (LGTM) stack along with Prometheus for the collection, visualization, and storage of telemetry data. Beginning with an overview of observability concepts, this book teaches you how to instrument code and monitor systems in practice using standard protocols and Grafana libraries. As you progress, you’ll create a free Grafana cloud instance and deploy a demo application to a Kubernetes cluster to delve into the implementation of the LGTM stack. You’ll learn how to connect Grafana Cloud to AWS, GCP, and Azure to collect infrastructure data, build interactive dashboards, make use of service level indicators and objectives to produce great alerts, and leverage the AI & ML capabilities to keep your systems healthy. You’ll also explore real user monitoring with Faro and performance monitoring with Pyroscope and k6. Advanced concepts like architecting a Grafana installation, using automation and infrastructure as code tools for DevOps processes, troubleshooting strategies, and best practices to avoid common pitfalls will also be covered. After reading this book, you’ll be able to use the Grafana stack to deliver amazing operational results for the systems your organization uses.

Publication date:: January 2024
Publisher: Packt
Pages: 356
ISBN: 9781803248004
Download code from GitHub

Introducing Observability and the Grafana Stack

The modern computer systems we work with have moved from the realm of complicated into the realm of complex, where the number of interacting variables make them ultimately unknowable and uncontrollable. We are using the terms complicated and complex as per system theory. A complicated system, like an engine, has clear causal relationships between components. A complex system, such as the flowing of traffic in a city, shows emergent behavior from the interactions of its components.

With the average cost of downtime estimated to be $9,000 per minute by Ponemon Institute in 2016, this complexity can cause significant financial loss if organizations do not take steps to manage this risk. Observability offers a way to mitigate these risks, but making systems observable comes with its own financial risks if implemented poorly or without a clear business goal.

In this book, we will give you a good understanding of what observability is and who the customers who might use it are. We will explore how to use the tools available from Grafana Labs to gain visibility of your organization. These tools include Loki, Prometheus, Mimir, Tempo, Frontend Observability, Pyroscope, and k6. You will learn how to use Service Level Indicators (SLIs) and Service Level Objectives (SLOs) to obtain clear transparent signals of when a service is operating correctly, and how to use the Grafana incident response tools to handle incidents. Finally, you will learn about managing their observability platform using automation tools such as Ansible, Terraform, and Helm.

This chapter aims to introduce observability to all audiences, using examples outside of the computing world. We’ll introduce the types of telemetry used by observability tools, which will give you an overview of how to use them to quickly understand the state of your services. The various personas who might use observability systems will be outlined so that you can explore complex ideas later with a clear grounding on who will benefit from their correct implementation. Finally, we’ll investigate Grafana’s Loki, Grafana, Tempo, Mimir (LGTM) stack, how to deploy it, and what alternatives exist.

In this chapter, we’re going to cover the following main topics:

Observability in a nutshell
Telemetry types and technologies
Understanding the customers of observability
Introducing the Grafana stack
Alternatives to the Grafana stack
Deploying the Grafana stack

Observability in a nutshell

The term observability is borrowed from control theory. It’s common to use the term interchangeably with the term monitoring in IT systems, as the concepts are closely related. Monitoring is the ability to raise an alarm when something is wrong, while observability is the ability to understand a system and determine whether something is wrong, and why.

Control theory was formalized in the late 1800s on the topic of centrifugal governors in steam engines. This diagram shows a simplified view of such a system:

Figure 1.1 – James Watt’s steam engine flyweight governor (source: https://www.mpoweruk.com)

Steam engines use a boiler to boil water in a pressure vessel. The steam pushes a piston backward and forward, which converts heat energy to reciprocal energy. In steam engines that use centrifugal governors, this reciprocal energy is converted to rotational energy via a wheel connected to a piston. The centrifugal governor provides a physical link backward through the system to the throttle. This means that the speed of rotation controls the throttle, which, in turn, controls the speed of rotation. Physically, this is observed by the balls on the governor flying outward and dropping inward until the system reaches equilibrium.

Monitoring defines the metrics or events that are of interest in advance. For instance, the governor measures the pre-defined metric of drive shaft revolutions. The controllability of the throttle is then provided by the pivot and actuator rod assembly. Assuming the actuator rod is adjusted correctly, the governor should control the throttle from fully open to fully closed.

In contrast, observability is achieved by allowing the internal state of the system to be inferred from its external outputs. If the operating point adjustment is incorrectly set, the governor may spin too fast or too slowly, rendering the throttle control ineffective. A governor spinning too fast or too slowly could also indicate that the sliding ring is stuck in place and needs oiling. Importantly, this insight can be gained without defining in advance what too fast or too slow means. The insight that the governor is spinning too fast or too slowly also needs very little knowledge of the full steam engine.

Fundamentally, both monitoring and observability are used to improve the reliability and performance of the system in question.

Now that we have introduced the high-level concepts, let’s explore a practical example outside of the world of software services.

Case study – A ship passing through the Panama Canal

Let’s imagine a ship traversing the Agua Clara locks on the Panama Canal. This can be illustrated using the following figure:

Figure 1.2 – The Agua Clara locks on the Panama Canal

There are a few aspects of these locks that we might want to monitor:

The successful opening and closing of each gate
The water level inside each lock
How long it takes for a ship to traverse the locks

Monitoring these aspects may highlight situations that we need to be alerted about:

A gate is stuck open because of a mechanical failure
The water level is rapidly descending because of a leak
A ship is taking too long to exit the locks because it is stuck

There may be situations where the data we are monitoring are within acceptable limits, but we can still observe a deviation from what is considered normal, which should prompt further action:

A small leak has formed near the top of the lock wall:
- We would see the water level drop but only when it is above the leak
- This could prompt maintenance work on the lock wall
A gate in one lock is opening more slowly because it needs maintenance:
- We would see the time between opening and closing the gate increase
- This could prompt maintenance on the lock gate
Ships take longer to traverse the locks when the wind is coming from a particular direction:
- We could compare hourly average traversal rates
- This could prompt work to reduce the impact of wind from one direction

Now that we’ve seen an example of measuring a real-world system, we can group these types of measurements into different data types to best suit the application. Let’s introduce those now.

Telemetry types and technologies

The boring but important part of observability tools is telemetry – capturing data that is useful, shipping it from place to place, and producing visualizations, alerts, and reports that offer value to the organization.

Three main types of telemetry are used to build monitoring and observability systems – metrics, logs, and distributed traces. Other telemetry types may be used by some vendors and in particular circumstances. We will touch on these here, but they will be explored in more detail in Chapters 12 and 13 of this book.

Metrics

Metrics can be thought of as numeric data that is recorded at a point in time and enriched with labels or dimensions to enable analysis. Metrics are frequently generated and are easy to search, making them ideal for determining whether something is wrong or unusual. Let’s look at an example of metrics showing temporal changes:

Figure 1.3 – Metrics showing changes over time

Taking our example of the Panama Canal, we could represent the water level in each lock as a metric, to be measured at regular intervals. To be able to use the data effectively, we might add some of these labels:

The lock name: Agua Clara
The lock chamber: Lower lock
The canal: Panama Canal

Logs

Logs are considered to be unstructured string data types. They are recorded at a point in time and usually contain a huge amount of information about what is happening. While logs can be structured, there is no guarantee of that structure persisting, because the log producer has control over the structure of the log. Let’s look at an example:

Jun 26 2016 20:31:01 pc-ac-g1 gate-events no obstructions seen
Jun 26 2016 20:32:01 pc-ac-g1 gate-events starting motors
Jun 26 2016 20:32:30 pc-ac-g1 gate-events motors engaged successfully
Jun 26 2016 20:35:30 pc-ac-g1 gate-events stopping motors
Jun 26 2016 20:35:30 pc-ac-g1 gate-events gate open complete

In our example, the various operations involved in opening or closing a lock gate could be represented as logs.

Almost every system produces logs, and they often give very detailed information. This is great for understanding what happened. However, the volume of data presents two problems:

Searching can be inefficient and slow.
As the data is in text format, knowing what to search for can be difficult. For example, error occurred, process failed, and action did not complete successfully could all be used to describe a failure, but there are no shared strings to search for.

Let’s consider a real log entry from a computer system to see how log data is usually represented:

Figure 1.4 – Logs showing discrete events in time

We can clearly see that we have a number of fields that have been extracted from the log entry by the system. These fields detail where the log entry originated from, what time it occurred, and various other items.

Distributed traces

Distributed traces show the end-to-end journey of an action. They are captured from every step that is taken to complete the action. Let’s imagine a trace that covers the passage of a ship through the lock system. We will be interested in the time a ship enters and leaves each lock, and we will want to be able to compare different ships using the system. A full passage can be given an identifier, usually called a trace ID. Traces are made up of spans. In our example, a span would cover the entry and exit for each individual lock. These spans are given a second identifier, called a span ID. To tie these two together, each span in a trace references the trace ID for the whole trace. The following screenshot shows an example of how a distributed trace is represented for a computer application:

Figure 1.5 – Traces showing the relationship of actions over time

Now that we have introduced metrics, logs, and traces, let’s consider a more detailed example of a ship passing through the locks, and how each telemetry type would be produced in this process:

Ship enters the first lock:
- Span ID created
- Trace ID created
- Contextual information is added to the span, for example, a ship identification
- Key events are recorded in the span with time stamps, for example, gates are opened and closed
Ship exits the first lock:
- Span closed and submitted to the recording system
- Second lock notified of trace ID and span ID
Ship enters the second lock:
- Span ID created
- Trace ID added to span
- Contextual information is added to the span
- Key events recorded in the span with time stamps
Ship exits the second lock:
- Span closed and submitted to the recording system
- Third lock notified of trace ID and span ID
Ship enters the third lock:
- Repeat step 3
Ship exits the third lock:
- Span closed and submitted to the recording system

Now let’s look at some other telemetry types.

Other telemetry types

Metrics, logs, and traces are often called the three pillars or the golden triangle of observability. As we outlined earlier, observability is the ability to understand a system. While metrics, logs, and traces give us a very good ability to understand a system, they are not the only signals we might need, as this depends at what abstraction layer we need to observe the system. For instance, when looking at a very detailed level, we may be very interested in the stack trace of an application’s activity at the CPU and RAM level. Conversely, if we are interested in the execution of a CI/CD pipeline, we may just be interested in whether a deployment occurred and nothing more.

Profiling data (stack traces) can give us a very detailed technical view of the system’s use of resources such as CPU cycles or memory. With cloud services often charged per hour for these resources, this kind of detailed analysis can easily create cost savings.

Similarly, events can be consumed from a platform, such as CI/CD. These can offer a huge amount of insight that can reduce the Mean Time to Recovery (MTTR). Imagine responding to an out-of-hours alert and seeing that a new version of a service was deployed immediately before the issues started occurring. Even better, imagine not having to wake up because the deployment process could check for failures and roll back automatically. Events differ from logs only in that an event represents a whole action. In our earlier example in the Logs section, we created five logs, but all of these referred to stages of the same event (opening the lock gate). As a relatively generic term, event gets used with other meanings.

Now that we’ve introduced the fundamental concepts of the technology, let’s talk about the customers who will use observability data.

Introducing the user personas of observers

Observability deals with understanding a system, identifying whether something is wrong with that system, and understanding why it is wrong. But what do we mean by understanding a system? The simple answer would be knowing the state of a single application or infrastructure component.

In this section, we will introduce the user personas that we will use throughout this book. These personas will help to distinguish the different types of questions that people use observability systems to answer.

Let’s take a quick look at the user personas that will be used throughout the book as examples, and their roles:

	Name and role	Description
	Diego Developer	Frontend, backend, full stack, and so on
	Ophelia Operator	SRE, DevOps, DevSecOps, customer success, and so on
	Steven Service	Service manager and other tasks
	Pelé Product	Product manager, product owner, and so on
	Masha Manager	Manager, senior leadership, and so on

Table 1.1 – User persona introductions

Now let’s look at each of these users in greater detail.

Diego Developer

Diego Developer works on many types of systems, from frontend applications that customers directly interact with, to backend systems that let his organization store data in ways that delight its customers. You might even find him working on platforms that other developers use to get their applications integrated, built, delivered, and deployed safely and speedily.

Goals

He writes great software that is well tested and addresses customers’ actual needs.

Interactions

When he is not writing code, he works with Ophelia Operator to address any questions and issues that occur.

Pelé Product works in his team and provides insight into the customer’s needs. They work together closely, taking those needs and turning them into detailed plans on how to deliver software that addresses them.

Steven Service is keen to ensure that the changes Diego makes are not impacting customer commitments. He’s also the one who wakes Diego up if there is an incident that needs attention. The data provided to Masha Manager gives her a breakdown of costs. When Diego is working on developer platforms, he also collects data that helps her get investment from the business into teams that are not performing as expected.

Needs

Diego really needs easy-to-use libraries for the languages he uses to instrument the code he produces. He does not have time to become an expert. He wants to be able to add a few lines of code and get results quickly.

Having a clear standard for acceptable performance measures makes it easy for him to get the right results.

Pain points

When Diego’s systems produce too much data, he finds it difficult to sort signal from noise. He also gets frustrated having to change his code because of an upstream decision to change tooling.

Ophelia Operator

Ophelia Operator works in an operations-focused environment. You might find her in a customer-facing role or as part of a development team as a DevOps engineer. She could be part of a group dedicated to the reliability of an organization’s systems, or she could be working in security or finance to ensure the business runs securely and smoothly.

Goals

Ophelia wants to make sure a product is functioning as expected. She also likes it when she is not woken up early in the morning by an incident.

Interactions

Ophelia will work a lot with Diego Developer; sometimes it’s escalating customer tickets when she doesn’t have the data available to understand the problem; at other times it’s developing runbooks to keep the systems running. Sometimes she will need to give Diego clear information on acceptable performance measures so that her team can make sure systems perform well for customers.

Steven Service works closely with Ophelia. They work together to ensure there are not many incidents, and that they are quickly resolved. Steven makes sure that business data on changes and incidents is tracked, and tweaks processes when things aren’t working.

Pelé Product likes to have data showing the problematic areas of his products.

Needs

Good data is necessary to do the job effectively. Being able to see that a customer has encountered an error can make the difference between resolving a problem straight away or having them wait maybe weeks for a response.

During an incident seeing that a new version of a service was deployed at the time a problem started can change an hours-long incident into a brief blip, and keep customers happy.

Pain points

Getting continuous alerts but not being empowered to fix the underlying issue is a big problem. Ophelia has seen colleagues burn out, and it makes her want to leave the organization when this happens.

Steven Service

Steven Service works in service delivery. He is interested in making sure the organization’s services are delivered smoothly. Jumping in on critical incidents and coordinating actions to get them resolved as quickly as possible is part of the job. So is ensuring that changes are made using processes that help others do it as safely as possible. Steven also works with third parties who provide services that are critical to the running of the organization.

Goals

He wants services to run as smoothly as possible so that the organization can spend more time focused on customers.

Interactions

Diego Developer and Ophelia Operator work a lot with the change management processes created by Steven and the support processes he manages. Having accurate data to hand during change management really helps to make the process as smooth as possible.

Steven works very closely with Masha Manager to make sure she has access to data showing where processes are working smoothly and where they need to spend time improving them.

Needs

He needs to be able to compare the delivery of different products and provide that data to Masha and the business.

During incidents, he needs to be able to get the right people on the call as quickly as possible and keep a record of what happened for the incident post-mortem.

Pain points

Being able to identify the right person to get on a call during an incident is a common problem he faces. Seeing incidents drag on while different systems are compared and who can fix the problem is argued about is also a big concern to him.

Pelé Product

Pelé Product works in the product team. You’ll find him working with customers to understand their needs, keeping product roadmaps in order, and communicating requirements back to developers such as Diego Developer so they can build them. You might also find him understanding and shaping the product backlog for the internal platforms used by developers in the organization.

Goal

Pelé wants to understand customers, give them products that delight them, and keep them coming back.

Interactions

He spends a lot of time working with Diego when they can look at the same information to really understand what customers are doing and how they can help them do it better.

Ophelia Operator and Steven Service help Pelé keep products on track. If too many incidents occur, they ask everyone to refocus on getting stability right. There is no point in providing customers with lots of features on a system that they can’t trust.

Pelé works closely with Masha Manager to ensure the organization has the right skills in the teams that build products. The business depends on her leadership to make sure that these developers have the best tools to help them get their code live in front of customers where it can be used.

Needs

Pelé needs to be able to understand customers’ pain points even when they do not articulate them clearly during user research.

He needs data that gives him a common language with Diego and Ophelia. Sometimes they can get too focused on specific numbers such as shaving off a couple of milliseconds from a request, when improving a poor workflow would improve the customer experience more significantly.

Pain points

Pelé hates not being able to see at a high level what customers are doing. Understanding which bits of an application have the most usage, and which bits are not used at all, lets him know where to focus time and resources.

While customers never tell him they want stability, if it’s not there they will lose trust very quickly and start to look at alternatives.

Masha Manager

Masha works in management. You might find her leading a team and working closely with them daily. She also represents middle management, setting strategy and making tactical choices, and she is involved, to some extent, in senior leadership. Much of her role involves managing budgets and people. If something can make that process easier, then she is usually interested in hearing about it. What Masha does not want to do is waste the organization’s money, because that can directly impact jobs.

Goals

Her primary goals are to keep the organization running smoothly and ensure the budget is balanced.

Interactions

As a leader, Masha needs accurate data and needs to be able to trust the teams who provide that data. The data could be the end-to-end cycle time of feature concept to delivery from Pelé Product, the lead time for changes from Diego Developer, or even the MTTR from Steven Service. Having that data helps her to understand where focus and resources can have the biggest impact.

Masha works regularly with the financial operations staff and needs to make sure they have accurate information on the organization’s expenditure and the value that expenditure provides.

Needs

She needs good data in a place where she can view it and make good decisions. This usually means she consumes information from a business intelligence system. To use such tools effectively, she needs to be clear on what the organization’s goals are, so that the correct data can be collected to help her understand how her teams are tracking to that goal.

She also needs to know that the teams she is responsible for have the correct data and tools to excel in their given areas.

Pain points

High failure rates and long recovery time usually result in her having to speak with customers to apologize. Masha really hates these calls!

Poor visibility of cloud systems is a particular concern. Masha has too many horror stories of huge overspending caused by a lack of monitoring; she would rather spend that budget on something more useful.

You now know about the customers who use observability data, and the types of data you will be using to meet their needs. As the main focus of this book is on Grafana as the underlying technology, let’s now introduce the tools that make up the Grafana stack.

Introducing the Grafana stack

Grafana was born in 2013 when a developer was looking for a new user interface to display metrics from Graphite. Initially forked from Kibana, the Grafana project was developed to make it easy to build quick, interactive dashboards that were valuable to organizations. In 2014, Grafana Labs was formed with the core value of building a sustainable business with a strong commitment to open source projects. From that foundation, Grafana has grown into a strong company supporting more than 1 million active installations. Grafana Labs is a huge contributor to open source projects, from their own tools to widely adopted technologies such as Prometheus, and recent initiatives with a lot of traction such as OpenTelemetry.

Grafana offers many tools, which we’ve grouped into the following categories:

The core Grafana stack: LGTM and the Grafana Agent
Grafana enterprise plugins
Incident response tools
Other Grafana tools

Let’s explore these tools in the following sections.

The core Grafana stack

The core Grafana stack consists of Mimir, Loki, Tempo, and Grafana; the acronym LGTM is often used to refer to this tech stack.

Mimir

Mimir is a Time Series Database (TSDB) for the storage of metric data. It uses low-cost object storage such as S3, GCS, or Azure Blob Storage. First announced for general availability in March 2022, Mimir is the newest of the four products we’ll discuss here, although it’s worth highlighting that Mimir initially forked from another project, Cortex, which was started in 2016. Parts of Cortex also form the core of Loki and Tempo.

Mimir is a fully Prometheus-compatible solution that addresses the common scalability problems encountered with storing and searching huge quantities of metric data. In 2021 Mimir was load tested to 1 billion active time series. An active time series is a metric with a value and unique labels that has reported a sample in the last 20 minutes. We will explore Mimir and Prometheus in much greater detail in Chapter 5.

Loki

Loki is a set of components that offer a full feature logging stack. Loki uses lower-cost object storage such as S3 or GCS, and only indexes label metadata. Loki entered general availability in November 2019.

Log aggregation tools typically use two data structures to store log data. An index that contains references to the location of the raw data paired with searchable metadata, and the raw data itself stored in a compressed form. Loki differs from a lot of other log aggregation tools by keeping the index data relatively small and scaling the search functionality by using horizontal scaling of the querying component. The process of selecting the best index fields is one we will cover in Chapter 4.

Tempo

Tempo is a storage backend for high-scale distributed trace telemetry, with the aim of sampling 100% of the read path. Like Loki and Mimir, it leverages lower-cost object storage such as S3, GCS, or Azure Blob Storage. Tempo went into general availability in June 2021.

When Tempo released 1.0, it was tested at a sustained ingestion of >2 million spans per second (about 350 MB per second). Tempo also offers the ability to generate metrics from spans as they are ingested; these metrics can be written to any backend that supports Prometheus remote write. Tempo is explored in detail in Chapter 6.

Grafana

Grafana has been a staple for fantastic visualization of data since 2014. It has targeted the ability to connect to a huge variety of data sources from TSDBs to relational databases and even other observability tools. Grafana has over 150 data source plugins available. Grafana has a huge community using it for many different purposes. This community supports over 6,000 dashboards, which means there is a starting place for most available technologies with minimal time to value.

Grafana Agent

Collecting telemetry from many places is one of the fundamental aspects of observability. Grafana Agent is a collection of tools for collecting logs, metrics, and traces. There are many other collection tools that Grafana integrates well with. Different collection tools offer different advantages and disadvantages, which is not a topic we will explore in this book. We will highlight other tools in the space later in this chapter and in Chapter 2 to give you a starting point for learning more about this topic. We will also briefly discuss architecting a collection infrastructure in Chapter 11.

The Grafana stack is a fantastic group of open source software for observability. The commitment of Grafana Labs to open source is supported by great enterprise plugins. Let’s explore them now.

Grafana Enterprise plugins

As part of their Cloud Pro, Cloud Advanced, and Enterprise license offerings, Grafana offers Enterprise plugins. These are part of any paid subscription to Grafana.

The Enterprise data source plugins allow organizations to read data from many other storage tools they may use, from software development tools such as GitLab and Azure DevOps to business intelligence tools such as Snowflake, Databricks, and Looker. Grafana also offers tools to read data from many other observability tools, which enables organizations to build comprehensive operational coverage while offering individual teams a choice of the tools they use.

Alongside the data source plugins, Grafana offers premium tools for logs, metrics, and traces. These include access policies and tokens for log data to secure sensitive information, in-depth health monitoring for the ingest and storage of cloud stacks, and management of tenants.

Grafana incident response and management

Grafana offers three products in the incident response and management (IRM) space:

At the foundation of IRM are alerting rules, which can notify via messaging apps, email, or Grafana OnCall
Grafana OnCall offers an on-call schedule management system that centralizes alert grouping and escalation routing
Finally, Grafana Incident offers a chatbot functionality that can set up necessary incident spaces, collect timelines for a post-incident review process, and even manage the incident directly from a messaging service

These tools are covered in more detail in Chapter 9. Now let’s take a look at some other important Grafana tools.

Other Grafana tools

Grafana Labs continues to be a leader in observability and has acquired several companies in this space to release new products that complement the tools we’ve already discussed. Let’s discuss some of these tools now.

Faro

Grafana Faro is a JavaScript agent that can be added to frontend web applications. The project allows for real user monitoring (RUM) by collecting telemetry from a browser. By adding RUM into an environment where backend applications and infrastructure are instrumented, observers gain the ability to traverse data from the full application stack. Faro supports the collection of the five core web vitals out of the box, as well as several other signals of interest. Faro entered general availability in November 2022. We cover Faro in more detail in Chapter 12.

k6

k6 is a load testing tool that provides both a packaged tool to run in your own infrastructure and a cloud Software as a Service (SaaS) offering. Load testing, especially as part of a CI/CD pipeline, really enables teams to see how their application will perform under load, and evaluate optimizations and refactoring. Paired with other detailed analysis tools such as Pyroscope, the level of visibility and accessibility to non-technical members of the team can be astounding. The project started back in 2016 and was acquired by Grafana Labs in June 2021. The goal of k6 is to make performance testing easy and repeatable. We’ll explore k6 in Chapter 13.

Pyroscope

Pyroscope is a recent acquisition of Grafana Labs, joining in March 2023. Pyroscope is a tool that enable teams to engage in the continuous profiling of system resource use by applications (CPU, memory, etc.). Pyroscope advertises that with a minimal overhead of ~2-5% of performance, they can collect samples as frequently as every 10 seconds. Phlare is a Grafana Labs project started in 2022, and the two projects have now merged. We discuss Pyroscope in more detail in Chapter 13.

Now that you know the different tools available from Grafana Labs, let’s look at some alternatives that are available.

Alternatives to the Grafana stack

The monitoring and observability space is packed with different open and closed source solutions such as ps and top going back to the 70s and 80s. We will not attempt to list every tool here; we aim to offer a source of inspiration for people who are curious and want to explore, or who need a quick reference of the available tools (as the authors have on a few occasions).

Data collection

These are agent tools that can be used to collect telemetry from the source:

Tool Name	Telemetry Types
OpenTelemetry Collector	Metrics, logs, traces
FluentBit	Metrics, logs, traces
Vector	Metrics, logs, traces
Vendor-specific agents (See the Data storage, processing, and visualization section for an expanded list)	Metrics, logs, traces
Beats family	Metrics, logs
Prometheus	Metrics
Telegraf	Metrics
StatsD	Metrics
Collectd	Metrics
Carbon	Metrics
Syslog-ng	Logs
Rsyslog	Logs
Fluentd	Logs
Flume	Logs
Zipkin Collector	Traces

Table 1.2 – Data collection tools

Data collection is only one piece of the extract transform and load process for observability data. The next section introduces tools to transform and load data.

Data storage, processing, and visualization

We’ve grouped data processing, storage, and visualization together, as there are often a lot of crossovers among them. There are certain tools that also provide security monitoring and are closely related. However, as this topic is outside of the scope of this book, we have chosen to exclude tools that are solely in the security space.

Tool Name	Tool Name	Tool Name
AppDynamics	InfluxDB	Sematext
Aspecto	Instana	Sensu
AWS CloudWatch & CloudTrail	Jaeger	Sentry
Azure Application insights	Kibana	Serverless360
Centreon	Lightstep	SigNoz
ClickHouse	Loggly	SkyWalking
Coralogix	LogicMonitor	Solarwinds
Cortex	Logtail	Sonic
Cyclotron	Logz.io	Splunk
Datadog	Mezmo	Sumo Logic
Dynatrace	Nagios	TelemetryHub
Elastic	NetData	Teletrace
GCP Cloud Operations Suite	New Relic	Thanos
Grafana Labs	OpenSearch	Uptrace
Graphite	OpenTSDB	VictoriaMetrics
Graylog	Prometheus	Zabbix
Honeycomb	Scalyr	Zipkin

Table 1.3 – Data storage processing and visualization tools

With a good understanding of the tools available in this space, let’s now look at the ways we can deploy the tools offered by Grafana.

Deploying the Grafana stack

Grafana Labs fully embraces its history as an open source software provider. The LGTM stack, alongside most other components, is open source. There are a few value-added components that are offered as part of an enterprise subscription.

As a SaaS offering, Grafana Labs provides access to storage for Loki, Mimir, and Tempo, alongside Grafana’s 100+ integrations for external data sources. As a SaaS customer, you also gain ready access to a huge range of other tools you may use and can present them in a consolidated manner, in a single pane of glass. The SaaS offering allows organizations to leverage a full-featured observability platform without the operational overhead of running the platform and obtaining service level agreements for the operation of the platform.

As well as managing the platform for you, you can run Grafana on your organization’s infrastructure. Grafana offers its software packaged in several formats for Linux and Windows deployments, as well as offering containerized versions. Grafana also offers Helm and Tanka configuration wrappers for each of their tools. This book will mainly concentrate on the SaaS offering because it is easy to get started with the free tier. We will explore some areas where a local installation can assist in Chapters 11 and 14, which cover architecting and supporting DevOps processes respectively.

Summary

In this chapter, you have been introduced to monitoring and observability, how they are similar, and how they differ. The Agua Clara locks on the Panama Canal acted as a simplified example of the concepts of observability in practice. The key takeaway should be to understand that even when a system produces alerts for significant problems, the same data can be used to observe and investigate other potential problems.

We also talked about the customers who might use observability systems. These customers will be referenced throughout this book when we explore a concept and how to target its implementation.

Finally, we introduced the full Grafana Labs stack, and you should now have a good understanding of the different purposes that each product serves.

In the next chapter, we will introduce the basics of adding instrumentation to applications or infrastructure components for readers whose roles are similar to those of Diego and Ophelia.

About the Authors

Rob Chapman

Rob Chapman is a creative IT engineer and founder at The Melt Cafe, with two decades of experience in the full application life cycle. Working over the years for companies such as the Environment Agency, BT Global Services, Microsoft, and Grafana, Rob has built a wealth of experience on large complex systems. More than anything, Rob loves saving energy, time, and money and has a track record for bringing production-related concerns forward so that they are addressed earlier in the development cycle, when they are cheaper and easier to solve. In his spare time, Rob is a Scout leader, and he enjoys hiking, climbing, and, most of all, spending time with his family and six children.
Browse publications by this author
Peter Holmes

Peter Holmes is a senior engineer with a deep interest in digital systems and how to use them to solve problems. With over 16 years of experience, he has worked in various roles in operations. Working at organizations such as Boots UK, Fujitsu Services, Anaplan, Thomson Reuters, and the NHS, he has experience in complex transformational projects, site reliability engineering, platform engineering, and leadership. Peter has a history of taking time to understand the customer and ensuring Day-2+ operations are as smooth and cost-effective as possible.
Browse publications by this author