Reader small image

You're reading from  Hands-On Infrastructure Monitoring with Prometheus

Product typeBook
Published inMay 2019
PublisherPackt
ISBN-139781789612349
Edition1st Edition
Right arrow
Authors (2):
Joel Bastos
Joel Bastos
author image
Joel Bastos

Joel Bastos is an open source supporter and contributor, with a background in infrastructure security and automation. He is always striving for the standardization of processes, code maintainability, and code reusability. He has defined, led, and implemented critical, highly available, and fault-tolerant enterprise and web-scale infrastructures in several organizations, with Prometheus as the cornerstone. He has worked at two unicorn companies in Portugal and at one of the largest transaction-oriented gaming companies in the world. Previously, he has supported several governmental entities with projects such as the Public Key Infrastructure for the Portuguese citizen card. You can find his blogs at kintoandar and on Twitter with the handle @kintoandar.
Read more about Joel Bastos

Pedro Araújo
Pedro Araújo
author image
Pedro Araújo

Pedro Arajo is a site reliability and automation engineer and has defined and implemented several standards for monitoring at scale. His contributions have been fundamental in connecting development teams to infrastructure. He is highly knowledgeable about infrastructure, but his passion is in the automation and management of large-scale, highly-transactional systems. Pedro has contributed to several open source projects, such as Riemann, OpenTSDB, Sensu, Prometheus, and Thanos. You can find him on Twitter with the handle @phcrva.
Read more about Pedro Araújo

View More author details
Right arrow

Scaling and Federating Prometheus

Prometheus was designed to be run as a single server. This approach will allow you to handle thousands of targets and millions of time series but, as you scale, you might find yourself in a situation where this just is not enough. This chapter tackles this necessity and clarifies how to scale Prometheus through sharding. However, sharding makes having a global view of the infrastructure harder. To address this, we will also go through the advantages and disadvantages of sharding, how federation comes into the picture, and, lastly, introduce Thanos, a component that was created by the Prometheus community to address some of the issues presented.

In brief, the following topics will be covered in this chapter:

  • Test environment for this chapter
  • Scaling with the help of sharding
  • Having a global view using federation
  • Using Thanos to mitigate Prometheus...

Test environment for this chapter

In this chapter, we'll be focusing on scaling and federating Prometheus. For this, we'll be deploying three instances to simulate a scenario where a global Prometheus instance gathers metrics from two others. This approach will allow us not only to explore the required configurations, but also to understand how everything works together.

The setup we'll be using is illustrated in the following diagram:

Figure 13.1: Test environment for this chapter

In the next section, we will explain how to get the test environment up and running.

Deployment

To launch a new test environment, move into the following chapter path, relative to the code repository root:

cd ./chapter13/

Ensure...

Scaling with the help of sharding

With growth come more teams, more infrastructure, more applications. With time, running a single Prometheus server can start to become infeasible: changes in recording/alerting rules and scrape jobs become more frequent (thus requiring reloads which, depending on the configured scrape intervals, can take up to a couple of minutes), missed scrapes can start to happen as Prometheus becomes overwhelmed, or the person or team responsible for that instance may simply become a bottleneck in terms of organizational process. When this happens, we need to rethink the architecture of our solution so that is scales accordingly. Thankfully, this is something the community has tackled time and time again, and so there are some recommendations on how to approach this problem. These recommendations revolve around sharding.

In this context, sharding means splitting...

Having a global view using federation

When you have multiple Prometheus servers, it can become quite cumbersome to know which one to query for a certain metric. Another problem that quickly comes up is how to aggregate data from multiple instances, possibly in multiple datacenters. Here's where federation comes into the picture. Federation allows you to have a Prometheus instance scraping selected time series from other instances, effectively serving as a higher-level aggregating instance. This can happen in a hierarchical fashion, with each layer aggregating metrics from lower-level instances into larger-encompassing time series, or in a cross-service pattern, where a few metrics are selected from instances in the same level for federation so that some recording and alerting rules become possible. For example, you could collect data for service throughput or latency in each...

Using Thanos to mitigate Prometheus shortcomings at scale

When you start to scale Prometheus, you quickly bump into the problem of cross-shard visibility. Indeed, Grafana can help, as you can add multiple datastore sources in the same dashboard panel, but this becomes harder to maintain, especially if multiple teams have different needs. Keeping track of which shard has which metric might not be trivial when there aren't clearly defined boundaries - while this might not be a problem when you have a shard per team as each team might only care about their own metrics, issues arise when there are several shards maintained by a single team and exposed as a service to the organization.

Additionally, it is common practice to run two identical Prometheus instances to prevent single points of failure (SPOF) in the alerting path - known as HA (or high-availability) pairs. This complicates...

Summary

In this chapter, we tackled issues concerning running Prometheus at scale. Even though a single Prometheus instance can get you a long way, it's a good idea to have the knowledge to grow if required. We've learned how vertical and horizontal sharding works, when to use sharding, and what benefits and concerns sharding brings. We were introduced to common patterns when federating Prometheus (hierarchical or cross-service), and how to choose between them depending on our requirements. Since, sometimes, we want more than the out-of-the-box federation, we were introduced to the Thanos project and how it solves the global view problem.

In the next chapter, we'll be tackling another common requirement and one that isn't a core concern of the Prometheus project, which is the long-term storage of time series.

Questions

  1. When should you consider sharding Prometheus?
  2. What's the difference between sharding vertically and horizontally?
  3. Is there anything you can do before opting for a sharding strategy?
  4. What type of metrics are best suited for being federated in a hierarchical pattern?
  5. Why might you require cross-service federation?
  6. What protocol is used between Thanos querier and sidecar?
  7. If a replica label is not set in a Thanos querier that is configured with sidecars running alongside Prometheus HA pairs, what happens to the results of queries that are executed there?
lock icon
The rest of the chapter is locked
You have been reading a chapter from
Hands-On Infrastructure Monitoring with Prometheus
Published in: May 2019Publisher: PacktISBN-13: 9781789612349
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at €14.99/month. Cancel anytime

Authors (2)

author image
Joel Bastos

Joel Bastos is an open source supporter and contributor, with a background in infrastructure security and automation. He is always striving for the standardization of processes, code maintainability, and code reusability. He has defined, led, and implemented critical, highly available, and fault-tolerant enterprise and web-scale infrastructures in several organizations, with Prometheus as the cornerstone. He has worked at two unicorn companies in Portugal and at one of the largest transaction-oriented gaming companies in the world. Previously, he has supported several governmental entities with projects such as the Public Key Infrastructure for the Portuguese citizen card. You can find his blogs at kintoandar and on Twitter with the handle @kintoandar.
Read more about Joel Bastos

author image
Pedro Araújo

Pedro Arajo is a site reliability and automation engineer and has defined and implemented several standards for monitoring at scale. His contributions have been fundamental in connecting development teams to infrastructure. He is highly knowledgeable about infrastructure, but his passion is in the automation and management of large-scale, highly-transactional systems. Pedro has contributed to several open source projects, such as Riemann, OpenTSDB, Sensu, Prometheus, and Thanos. You can find him on Twitter with the handle @phcrva.
Read more about Pedro Araújo