Reader small image

You're reading from  Mastering Prometheus

Product typeBook
Published inApr 2024
PublisherPackt
ISBN-139781805125662
Edition1st Edition
Concepts
Right arrow
Author (1)
William Hegedus
William Hegedus
author image
William Hegedus

William Hegedus has worked in tech for over a decade in a variety of roles, culminating in site reliability engineering. He developed a keen interest in Prometheus and observability technologies during his time managing a 24/7 NOC environment and eventually became the first SRE at Linode, one of the foremost independent cloud providers. Linode was acquired by Akamai Technologies in 2022, and now Will manages a team of SREs focused on building the internal observability platform for Akamai's Connected Cloud. His team is responsible for a global fleet of Prometheus servers spanning over two dozen data centers and ingesting millions of data points every second, in addition to operating a suite of other observability tools. Will is an open source advocate and contributor who has contributed code to Prometheus, Thanos, and many other CNCF projects related to Kubernetes and observability. He lives in central Virginia with his wonderful wife, four kids, three cats, two dogs, and a bearded dragon.
Read more about William Hegedus

Right arrow

Advancing Prometheus: Sharding, Federation, and High Availability

If you’re reading this book, chances are that you already had some experience with Prometheus before finding this book. If so, at some point, you’ve also probably run into the need to scale Prometheus beyond just a single Prometheus instance managing everything. There are a variety of solutions to this problem and in this chapter, we’ll cover a few of the built-in ones. As a bonus, we’ll look at how to make your Prometheus metrics highly available.

In this chapter, we’re going to cover the following main topics:

  • Prometheus’ limitations
  • Sharding Prometheus
  • Federating Prometheus
  • Achieving high availability (HA) in Prometheus

Let’s get started!

Technical requirements

For this chapter, we’ll be using the Kubernetes cluster and Prometheus environment we created in Chapter 2. Consequently, we’ll need the following tools to be installed to interact with them:

The code that’s used in this chapter is available at https://github.com/PacktPublishing/Mastering-Prometheus.

Prometheus’ limitations

When you first start using it, Prometheus may seem like it can do anything. It’s a hammer and everything looks like a nail. After a while, though, reality sets in and the cracks start to show. Queries start to get slower. Memory usage begins to creep up. You’re waiting longer and longer for WAL replays to finish when Prometheus starts up. Where did I go wrong? What do I need to do?

Rest assured, you did nothing wrong and you’re not alone. Prometheus, like any other technology, has limitations. Some of them are specific to Prometheus and others are limitations of time series databases in general. So, what are they? Well, the two we need to care about the most are cardinality and long-term storage.

Cardinality

Cardinality refers to the measure of unique values of a dataset. High cardinality indicates that there is a large number of unique values, whereas low cardinality is a small number of unique values.

Examples of high...

Sharding Prometheus

Chances are that if you’re looking to improve your Prometheus architecture through sharding, you’re hitting one of the limitations we talked about and it’s probably cardinality. You have a Prometheus instance that’s just got too much data in it, but… you don’t want to get rid of any data. So, the logical answer is… run another Prometheus instance!

When you split data across Prometheus instances like this, it’s referred to as sharding. If you’re familiar with other database designs, it probably isn’t sharding in the traditional sense. As previously established, Prometheus TSDBs do not talk to each other, so it’s not as if they’re coordinating to shard data across instances. Instead, you predetermine where data will be placed by how you configure the scrape jobs on each instance. So, it’s more like sharding scrape jobs than sharding the data. There are two main ways to accomplish...

Federating Prometheus

The elusive “single pane of glass.” Everybody wants it. Every software vendor purports to sell it. The dream is to have one place you can go to see all of your monitoring data in just that single location. Sharding Prometheus may seem antithetical to that but through federation, we can still achieve it.

What is federation? Federation is the process of joining together metrics from multiple sources into a central location. It is useful for aggregating your metrics into a centralized Prometheus instance. They may be a 1:1 match of the metrics present in the “lower” Prometheus instances, but they can also have PromQL query functions applied to them to perform series aggregation and consolidation before storage in the “higher” Prometheus instance(s).

Should I federate?

If you’re federating because you’ve sharded your Prometheus instances, you’ve probably already run into the issue where one Prometheus...

Achieving high availability (HA) in Prometheus

Your monitoring environment needs to be one of your most resilient services. It can be a joke that there’s no such thing as 100% uptime, but your monitoring environment should come pretty darn close. After all, it’s what you depend on to let you know when your other services aren’t achieving their 99.9% uptime goal.

Thus far, we’ve only used Prometheus in a single-point-of-failure mode. If Prometheus goes down, all of its metrics and alerts go down with it. This gap in visibility and alerting is unacceptable. So, what can we do about it if Prometheus doesn’t have built-in HA like Alertmanager? The answer? Duplicate it.

Who watches the watchmen?

With an HA Prometheus setup, you can (and should) configure your Prometheus instances so that they monitor each other. Presuming they’re not running on the same physical hardware, unexpected failures should be isolated and you can be alerted to...

Summary

In this chapter, we learned how to advance our Prometheus architecture through the use of powerful patterns such as sharding, federation, and highly available replicas. At this point, we have an idea of how to scale our Prometheus instance as our usage grows. But what about when things go wrong? In the next chapter, we’ll talk about how we can optimize and debug Prometheus in production.

Further reading

To learn more about the topics that were covered in this chapter, take a look at the following resources:

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Mastering Prometheus
Published in: Apr 2024Publisher: PacktISBN-13: 9781805125662
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
William Hegedus

William Hegedus has worked in tech for over a decade in a variety of roles, culminating in site reliability engineering. He developed a keen interest in Prometheus and observability technologies during his time managing a 24/7 NOC environment and eventually became the first SRE at Linode, one of the foremost independent cloud providers. Linode was acquired by Akamai Technologies in 2022, and now Will manages a team of SREs focused on building the internal observability platform for Akamai's Connected Cloud. His team is responsible for a global fleet of Prometheus servers spanning over two dozen data centers and ingesting millions of data points every second, in addition to operating a suite of other observability tools. Will is an open source advocate and contributor who has contributed code to Prometheus, Thanos, and many other CNCF projects related to Kubernetes and observability. He lives in central Virginia with his wonderful wife, four kids, three cats, two dogs, and a bearded dragon.
Read more about William Hegedus