You're reading from Mastering Prometheus

Product typeBook

Published inApr 2024

PublisherPackt

ISBN-139781805125662

Edition1st Edition

Concepts

DevOps

Author (1)

William Hegedus

Advancing Prometheus: Sharding, Federation, and High Availability

If you’re reading this book, chances are that you already had some experience with Prometheus before finding this book. If so, at some point, you’ve also probably run into the need to scale Prometheus beyond just a single Prometheus instance managing everything. There are a variety of solutions to this problem and in this chapter, we’ll cover a few of the built-in ones. As a bonus, we’ll look at how to make your Prometheus metrics highly available.

In this chapter, we’re going to cover the following main topics:

Prometheus’ limitations
Sharding Prometheus
Federating Prometheus
Achieving high availability (HA) in Prometheus

Let’s get started!

Technical requirements

For this chapter, we’ll be using the Kubernetes cluster and Prometheus environment we created in Chapter 2. Consequently, we’ll need the following tools to be installed to interact with them:

kubectl: https://kubernetes.io/docs/tasks/tools/#kubectl
helm: https://helm.sh/docs/intro/install/

The code that’s used in this chapter is available at https://github.com/PacktPublishing/Mastering-Prometheus.

Prometheus’ limitations

When you first start using it, Prometheus may seem like it can do anything. It’s a hammer and everything looks like a nail. After a while, though, reality sets in and the cracks start to show. Queries start to get slower. Memory usage begins to creep up. You’re waiting longer and longer for WAL replays to finish when Prometheus starts up. Where did I go wrong? What do I need to do?

Rest assured, you did nothing wrong and you’re not alone. Prometheus, like any other technology, has limitations. Some of them are specific to Prometheus and others are limitations of time series databases in general. So, what are they? Well, the two we need to care about the most are cardinality and long-term storage.

Cardinality

Cardinality refers to the measure of unique values of a dataset. High cardinality indicates that there is a large number of unique values, whereas low cardinality is a small number of unique values.

Examples of high...

Sharding Prometheus

Chances are that if you’re looking to improve your Prometheus architecture through sharding, you’re hitting one of the limitations we talked about and it’s probably cardinality. You have a Prometheus instance that’s just got too much data in it, but… you don’t want to get rid of any data. So, the logical answer is… run another Prometheus instance!

When you split data across Prometheus instances like this, it’s referred to as sharding. If you’re familiar with other database designs, it probably isn’t sharding in the traditional sense. As previously established, Prometheus TSDBs do not talk to each other, so it’s not as if they’re coordinating to shard data across instances. Instead, you predetermine where data will be placed by how you configure the scrape jobs on each instance. So, it’s more like sharding scrape jobs than sharding the data. There are two main ways to accomplish...

Federating Prometheus

The elusive “single pane of glass.” Everybody wants it. Every software vendor purports to sell it. The dream is to have one place you can go to see all of your monitoring data in just that single location. Sharding Prometheus may seem antithetical to that but through federation, we can still achieve it.

What is federation? Federation is the process of joining together metrics from multiple sources into a central location. It is useful for aggregating your metrics into a centralized Prometheus instance. They may be a 1:1 match of the metrics present in the “lower” Prometheus instances, but they can also have PromQL query functions applied to them to perform series aggregation and consolidation before storage in the “higher” Prometheus instance(s).

Should I federate?

If you’re federating because you’ve sharded your Prometheus instances, you’ve probably already run into the issue where one Prometheus...

Achieving high availability (HA) in Prometheus

Your monitoring environment needs to be one of your most resilient services. It can be a joke that there’s no such thing as 100% uptime, but your monitoring environment should come pretty darn close. After all, it’s what you depend on to let you know when your other services aren’t achieving their 99.9% uptime goal.

Thus far, we’ve only used Prometheus in a single-point-of-failure mode. If Prometheus goes down, all of its metrics and alerts go down with it. This gap in visibility and alerting is unacceptable. So, what can we do about it if Prometheus doesn’t have built-in HA like Alertmanager? The answer? Duplicate it.

Who watches the watchmen?

With an HA Prometheus setup, you can (and should) configure your Prometheus instances so that they monitor each other. Presuming they’re not running on the same physical hardware, unexpected failures should be isolated and you can be alerted to...

Summary

In this chapter, we learned how to advance our Prometheus architecture through the use of powerful patterns such as sharding, federation, and highly available replicas. At this point, we have an idea of how to scale our Prometheus instance as our usage grows. But what about when things go wrong? In the next chapter, we’ll talk about how we can optimize and debug Prometheus in production.

William Hegedus has worked in tech for over a decade in a variety of roles, culminating in site reliability engineering. He developed a keen interest in Prometheus and observability technologies during his time managing a 24/7 NOC environment and eventually became the first SRE at Linode, one of the foremost independent cloud providers. Linode was acquired by Akamai Technologies in 2022, and now Will manages a team of SREs focused on building the internal observability platform for Akamai's Connected Cloud. His team is responsible for a global fleet of Prometheus servers spanning over two dozen data centers and ingesting millions of data points every second, in addition to operating a suite of other observability tools. Will is an open source advocate and contributor who has contributed code to Prometheus, Thanos, and many other CNCF projects related to Kubernetes and observability. He lives in central Virginia with his wonderful wife, four kids, three cats, two dogs, and a bearded dragon.
Read more about William Hegedus

Personalised recommendations for you

Based on your interests and search pattern

Designing and Implementing Microsoft Azure Networking Solutions

Designing and Implementing Microsoft Azure Networking Solutions Exam Ref AZ-700 is an all-encompassing guide to the AZ-700 exam and contains all the information you need to succeed in the world of virtual networking with Azure. With this book, you will be fully prepared for the exam and the world of cloud networking.

BookAug 2023524 pages

Microsoft 365 Security, Compliance, and Identity Administration

The Microsoft 365 Security, Compliance, and Identity Administration is a comprehensive guide that helps you employ Microsoft 365's robust suite of features and empowers you to optimize your administrative tasks.

BookAug 2023630 pages

Zero Trust Overview and Playbook Introduction

Get started on Zero Trust with this step-by-step playbook and learn everything you need to know for a successful Zero Trust journey with tailored guidance for every role, covering strategy, operations, architecture, implementation, and measuring success. This book will become an indispensable reference for everyone in your organization.

BookOct 2023240 pages

The Self-Taught Cloud Computing Engineer

This self-study book helps you master multiple clouds, including AWS, Azure, and GCP, and serves as a roadmap to becoming a certified cloud computing expert. The book will guide you to develop a professional cloud career by helping you build a broad cloud knowledge base, developing hands-on cloud computing skills, and getting cloud certified.

BookSep 2023472 pages

Technology Operating Models for Cloud and Edge

This book will help you build and create ownership of a technology operating model, as well as connect your leadership with engineering and operations, keeping your internal and external customers in mind. It provides practical tips on why, where, and how to make the cloud and edge platform paradigm sing for you, your team, and your organization.

BookAug 2023228 pages

Azure Architecture Explained

Azure is the preferred platform to build mission-critical and secure apps. This book provides comprehensive coverage of essential Azure products, services, and solutions vital for every solution architect's success. Elevate your knowledge and master the critical components of Azure to excel in your role with Azure Architecture Explained.

BookSep 2023446 pages

Pentesting Active Directory and Windows-based Infrastructure

This practical guide helps you explore the pentesting of Microsoft infrastructure in detail, and enhances your offensive skillset by showing you the different ways to perform security assessment. This book will help blue teamers and IT engineers get up to speed with possible security issues they may encounter in their Windows environments.

BookNov 2023360 pages

Practical Ansible

In Practical Ansible, you'll work with the latest release of Ansible and learn to solve complex issues quickly with the help of task-oriented scenarios. You'll start by installing and configuring Ansible to automate monotonous and repetitive IT tasks and get to grips with concepts such as playbooks, inventories, plugins, collections, and network modules.

BookSep 2023420 pages

Windows 11 for Enterprise Administrators

Microsoft’s launch of Windows 11 is a step toward satisfying the enterprise administrator’s needs for better management and enhanced user experience customization. This book provides the enterprise administrator with the knowledge needed to fully utilize the advanced feature set of Windows 11 Enterprise.

BookOct 2023286 pages

The Linux DevOps Handbook

This book is for software and IT professionals seeking knowledge on Linux systems and DevOps practices. This book will provide you with guidance and tools to learn and gain proficiency in managing Linux-based infrastructures and knowledge of DevOps.

BookNov 2023428 pages2

You're reading from Mastering Prometheus

Advancing Prometheus: Sharding, Federation, and High Availability

Technical requirements

Prometheus’ limitations

Cardinality

Sharding Prometheus

Federating Prometheus

Achieving high availability (HA) in Prometheus

Summary

Further reading

Unlock this book and the full library FREE for 7 days

Author (1)

Designing and Implementing Microsoft Azure Networking Solutions

Microsoft 365 Security, Compliance, and Identity Administration

The Microsoft 365 Security, Compliance, and Identity Administration is a comprehensive guide that helps you employ Microsoft 365's robust suite of features and empowers you to optimize your administrative tasks.

Zero Trust Overview and Playbook Introduction

The Self-Taught Cloud Computing Engineer

Technology Operating Models for Cloud and Edge

Azure Architecture Explained

Pentesting Active Directory and Windows-based Infrastructure

Practical Ansible

Windows 11 for Enterprise Administrators

The Linux DevOps Handbook

This book is for software and IT professionals seeking knowledge on Linux systems and DevOps practices. This book will provide you with guidance and tools to learn and gain proficiency in managing Linux-based infrastructures and knowledge of DevOps.