Reader small image

You're reading from  Mastering Prometheus

Product typeBook
Published inApr 2024
PublisherPackt
ISBN-139781805125662
Edition1st Edition
Concepts
Right arrow
Author (1)
William Hegedus
William Hegedus
author image
William Hegedus

William Hegedus has worked in tech for over a decade in a variety of roles, culminating in site reliability engineering. He developed a keen interest in Prometheus and observability technologies during his time managing a 24/7 NOC environment and eventually became the first SRE at Linode, one of the foremost independent cloud providers. Linode was acquired by Akamai Technologies in 2022, and now Will manages a team of SREs focused on building the internal observability platform for Akamai's Connected Cloud. His team is responsible for a global fleet of Prometheus servers spanning over two dozen data centers and ingesting millions of data points every second, in addition to operating a suite of other observability tools. Will is an open source advocate and contributor who has contributed code to Prometheus, Thanos, and many other CNCF projects related to Kubernetes and observability. He lives in central Virginia with his wonderful wife, four kids, three cats, two dogs, and a bearded dragon.
Read more about William Hegedus

Right arrow

Extending Prometheus Globally with Thanos

The remote storage systems that we covered in the previous chapter are not for everyone. Perhaps you are perfectly happy to run your Prometheus instances globally without any centralized place where you’re aggregating those metrics. However, it sure would be nice to have some way to run queries from a centralized place and fan them out to all of your Prometheus instances… Good news! Thanos can do that, and more!

Thanos is less of a pre-built, comprehensive solution the way VictoriaMetrics or Mimir are, and more of a Swiss Army knife of a la carte components that can be mixed and matched to fit your specific use case. At the time of writing, seven different components comprise the Thanos project and you can run as few or as many of them as you need.

In this chapter, we’ll cover them all in these main topics:

  • Overview of Thanos
  • Thanos Sidecar
  • Thanos Compactor
  • Thanos Query
  • Thanos Query Frontend...

Technical requirements

For this chapter, we’ll continue building off of the Prometheus environment we deployed to Kubernetes in Chapter 2. Consequently, you’ll need the two following tools installed:

This chapter’s code is available at https://github.com/PacktPublishing/Mastering-Prometheus.

Overview of Thanos

The Thanos project began at Improbable and was spearheaded by two scions of the Prometheus ecosystem, Bartłomiej Płotka and Fabian Reinartz, who have both also been significant contributors to Prometheus itself and various other Prometheus-related projects. After a short period, the project was donated to the Cloud Native Computing Foundation (CNCF), where it is now designated as an “incubating” project.

The three stated goals of the Thanos project are as follows:

  • Global query view of metrics
  • Unlimited retention of metrics
  • High availability of components, including Prometheus

The core of Thanos (its original components) is comprised of Thanos Sidecar, Thanos Store, Thanos Compact, and Thanos Query. Each component is a subcommand of the thanos binary – no need to download and deploy separate executables for each of the components. Through each of its various components, Thanos enables distributed querying...

Thanos Sidecar

Thanos Sidecar is the most fundamental Thanos component, enabling Thanos’s two most popular features: querying multiple Prometheus instances from a centralized location (Thanos Query) and backing up Prometheus TSDB data in an object storage backend.

Thanos Sidecar works by running alongside the Prometheus server as a “sidecar” (clever naming, huh?). This is technically only a strict requirement if you’re using the sidecar to upload Prometheus data to object storage, but you should strive to deploy the Sidecar alongside your Prometheus instance, even if you don’t use that feature. Doing so will help minimize latency between the Sidecar and Prometheus.

The Sidecar fulfills its first job of enabling distributed querying of Prometheus instances by exposing a gRPC API that Thanos Query (or other Thanos components such as Ruler) can communicate with. This gRPC API (henceforth referred to as StoreAPI) is implemented and exposed by the...

Thanos Compactor

The Thanos Compactor component is responsible for compacting and downsampling TSDB blocks stored in our Object Storage provider. Since we’ve disabled local compaction of TSDB blocks on the Prometheus instance, we still need to compact them somehow to ensure efficient storage of our data. Hence, the Thanos project provides a component for compaction.

Thanos Compactor handles compaction in the same way that Prometheus does – it takes several small blocks and compacts their indices and samples to make a larger block with an index that uses less space than if all the composite blocks still maintained a separate index. This relies on the presupposition that most time series exist across multiple sequential blocks, which should almost always be the case.

There’s not much to note about how Thanos achieves this other than the requisite changes to account for the fact that the Compactor must download the blocks from object storage to compact them...

Thanos Query

Thanos Query is another of the most fundamental Thanos components. Without it, there’s not much point to Thanos Sidecar. It provides both a web UI and an API that are used to execute PromQL queries across multiple data sources (for example, Prometheus via Thanos Sidecar, metrics in object storage via Thanos Store, and so on).

The web UI will feel familiar to anyone who has used the Prometheus web UI since their functionality and UX are essentially equivalent. The query API is also 100% PromQL compliant and therefore can be used as a Prometheus-typed data source in Grafana. In practice at companies I’ve been at, we’ve even tended to use Thanos Query as our default data source in Grafana.

Thanos Query works by connecting to one or more endpoints that implement Thanos’s gRPC-based StoreAPI. Endpoints can be specified via the repeatable --endpoint flag. This flag supports both static definitions (for example, --endpoint=192.168.1.2:10901)...

Thanos Query Frontend

Thanos Query Frontend is a service that can be deployed in front of Thanos Query to improve query performance by splitting large-range queries into smaller ones and also caching results. It is based on a similar component implemented by Cortex (https://github.com/cortexproject/cortex), the predecessor to Mimir. You can think of it as a pre-processor of queries, where the majority of actual work is still done by the downstream queries.

Query sharding and splitting

Presuming you run multiple top-level Thanos Query instances, you can put Query Frontend in front of them to share the load between them more efficiently than simply load balancing between the two of them with something such as Nginx. This can be accomplished through query splitting based on time ranges and/or vertical sharding.

Query splitting

By default, the --query-range.split-interval flag is set to split range queries on a 24h interval. This means that if you query sum(my_metric) over...

Thanos Store

Thanos Store is perhaps the simplest Thanos component in terms of usage. There is not too much to tweak or tune and it does not require much in terms of resources. It is effectively stateless, although you can use persistent storage with it to reduce startup time as it populates metadata about the available blocks in object storage. We’ll get into some ways that you can horizontally scale it later, but for now, let’s just focus on what it is and how to deploy it.

Thanos Store is another component that implements the StoreAPI, so you can use Thanos Query to pull data from it. Thanos Store’s purpose is to function as a gateway to object storage, which is why you may sometimes see it referred to as the “Store Gateway.”

When Thanos Sidecar uploads blocks and when Thanos Compactor operates on them, they update the meta.json file within that block (see Chapter 3 for more information on that file) with a new thanos section. Thanos Store...

Thanos Ruler

Thanos Ruler enables a unique feature for advanced use cases: evaluating Prometheus rules across multiple Prometheus instances. For example, consider a service you have deployed across multiple regions with a Prometheus deployment in each region responsible for monitoring its corresponding instance of the service. Thanos Ruler would enable you to evaluate Prometheus rules across all of those Prometheus instances to obtain a more holistic view of your service. This is great for measuring things such as service-level objectives (SLOs).

Thanos Ruler accomplishes this by connecting to one or more Thanos Query endpoints to run queries against. If more than one is specified, it performs round-robin balancing of queries. In other words, a rule’s query is not evaluated by every specified Query instance – only one is chosen and used per query.

Data produced by evaluating recording rules is stored in a local TSDB in the same manner that it would be on Prometheus...

Thanos Receiver

Thanos Receiver (or “Receive”) is the last of our components to deploy and arguably has the potential to be the most complex Thanos component in your stack, depending on how you configure it. This is mostly because it is extremely configurable for multi-tenant and/or large-scale use cases. However, since it focuses primarily on receiving remote write data, we’ll skip diving too deep into the details of remote write since you’re already familiar with it from the previous chapter.

Like other Thanos components, Thanos Receive is also intended to connect to object storage to upload TSDB blocks. It maintains a local TSDB while receiving data but will upload blocks to object storage when they are flushed to disk every 2 hours. Unlike Thanos Ruler, it is also intended to maintain a local copy of data for a longer period – by default, 15 days (configurable via the --tsdb.retention flag).

A note on deduplication

As opposed to other...

Thanos tools

In addition to all of the Thanos components we have reviewed thus far, the Thanos CLI also has a thanos tools sub-command. This contains a variety of helpful tools, primarily for interacting with object storage buckets and the TSDB blocks within them. It also contains a command for validating recording and alerting rules used by Thanos Ruler.

Since these tools are primarily for use in existing, established environments, we won’t cover them individually in this book. Nevertheless, they may be worth experimenting with before cleaning up the Thanos components you’ve deployed in this chapter. Within any of the Thanos pods we deployed in this chapter, you can run thanos tools --help to see all of the available options.

Cleanup

Now that we’re done experimenting with the suite of Thanos components, you can clean up your environment by reverting to our simple Prometheus deployment via Helm, like so:

$ helm upgrade --namespace prometheus \
    --version 47.0.0 \
    --values mastering-prometheus/ch10/prom-values.yaml \
    mastering-prometheus \
    prometheus-community/kube-prometheus-stack

Additionally, you can delete all of the Thanos components and our object storage configuration via kubectl:

$ kubectl delete secret thanos-objstore-config
$ kubectl delete -f https://raw.githubusercontent.com/PacktPublishing/Mastering-Prometheus/main/ch10/manifests/thanos-compact.yaml
$ kubectl delete -f https://raw.githubusercontent.com/PacktPublishing/Mastering-Prometheus/main/ch10/manifests/thanos-query.yaml
$ kubectl delete -f https://raw.githubusercontent.com/PacktPublishing/Mastering-Prometheus/main/ch10/manifests...

Summary

In this chapter, we went through all of the Thanos components that are available to gain a greater understanding of the comprehensive suite of features offered by the Thanos project.

We learned how Thanos Sidecar enables long-term storage of metrics in object storage and distributed querying through Thanos Query, how Thanos Compactor operates on those uploaded TSDB blocks in object storage to compact and downsample them, and how Thanos Store retrieves them from object storage on-demand for queries.

We saw how Thanos Query enables distributed querying of metrics from the various components that implement Thanos’s gRPC StoreAPI, and how Thanos Query Frontend enables more efficient use of Thanos Query instances through caching, query sharding, and splitting.

We utilized Thanos Ruler so that we could evaluate Prometheus alerts and rules across all endpoints connected to a Thanos Query instance.

Finally, we learned about and deployed Thanos Receiver so that we...

Further reading

To learn more about the topics that were covered in this chapter, take a look at the following resources:

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Mastering Prometheus
Published in: Apr 2024Publisher: PacktISBN-13: 9781805125662
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
William Hegedus

William Hegedus has worked in tech for over a decade in a variety of roles, culminating in site reliability engineering. He developed a keen interest in Prometheus and observability technologies during his time managing a 24/7 NOC environment and eventually became the first SRE at Linode, one of the foremost independent cloud providers. Linode was acquired by Akamai Technologies in 2022, and now Will manages a team of SREs focused on building the internal observability platform for Akamai's Connected Cloud. His team is responsible for a global fleet of Prometheus servers spanning over two dozen data centers and ingesting millions of data points every second, in addition to operating a suite of other observability tools. Will is an open source advocate and contributor who has contributed code to Prometheus, Thanos, and many other CNCF projects related to Kubernetes and observability. He lives in central Virginia with his wonderful wife, four kids, three cats, two dogs, and a bearded dragon.
Read more about William Hegedus