Reader small image

You're reading from  Mastering Prometheus

Product typeBook
Published inApr 2024
PublisherPackt
ISBN-139781805125662
Edition1st Edition
Concepts
Right arrow
Author (1)
William Hegedus
William Hegedus
author image
William Hegedus

William Hegedus has worked in tech for over a decade in a variety of roles, culminating in site reliability engineering. He developed a keen interest in Prometheus and observability technologies during his time managing a 24/7 NOC environment and eventually became the first SRE at Linode, one of the foremost independent cloud providers. Linode was acquired by Akamai Technologies in 2022, and now Will manages a team of SREs focused on building the internal observability platform for Akamai's Connected Cloud. His team is responsible for a global fleet of Prometheus servers spanning over two dozen data centers and ingesting millions of data points every second, in addition to operating a suite of other observability tools. Will is an open source advocate and contributor who has contributed code to Prometheus, Thanos, and many other CNCF projects related to Kubernetes and observability. He lives in central Virginia with his wonderful wife, four kids, three cats, two dogs, and a bearded dragon.
Read more about William Hegedus

Right arrow

Optimizing and Debugging Prometheus

Even as we scale Prometheus out to multiple replicas and possibly even shards, there will still be performance issues that we need to know how to identify and address. Consequently, this chapter will focus on how we can go about optimizing Prometheus to make the most of the resources it has and how to debug issues when they arise.

In this chapter, we’re going to cover the following main topics:

  • Controlling cardinality
  • Recording rules
  • Scrape jitter
  • Using pprof
  • Query logging
  • Tuning garbage collection

Let’s get started!

Technical requirements

For this chapter, we’ll be using the Kubernetes cluster and Prometheus environment we created in Chapter 2. Consequently, we’ll need the following tools installed to interact with it:

This chapter’s code is available at https://github.com/PacktPublishing/Mastering-Prometheus.

Controlling cardinality

In the previous chapter, we discussed the concept of data cardinality. To refresh your memory, cardinality refers to the measure of unique values in a dataset. We know that time series databases in general have difficulty managing high cardinality datasets without this resulting in severe impacts on query performance.

This cardinality issue isn’t even necessarily specific to time series databases – relational databases such as MySQL or PostgreSQL also need to take cardinality into account when selecting data. In relational databases, large tables are often partitioned to improve query performance. That’s not an option with Prometheus’s TSDB, especially since you could consider the data to already be partitioned since each block functions as an independent database. Consequently, the closest we can get to partitioning data based on something other than time is by sharding, as we discussed in the previous chapter.

However, even...

Recording rules

A simple way to improve the performance of your Prometheus queries is through the use of recording rules.

When performing complex queries over long time ranges, your query execution times can get quite long. If you’re putting these queries in dashboards in Grafana – or otherwise frequently using them – it can negatively impact user experience. Fortunately, Prometheus enables the pre-computation of these expensive queries via recording rules.

Recording rules differ from alerting rules because they produce new time series based on the results of the PromQL expression they evaluate. They can be defined alongside alerting rules, including within the same rule groups.

Recording rule conventions

Since recording rules produce new time series, they need to have unique names. Additionally, it should be made clear that a metric is created by a recording rule. Prometheus does not enforce naming restrictions for recording rules differently than...

Scrape jitter

Scrape jitter is the most common cause of oversized TSDB blocks that I have observed. Recall from Chapter 3 how – from the third scrape onwards – timestamp values are stored in the TSDB as they only store the delta of the delta of the sample timestamp. So long as this delta of the delta is 0, the TSDB’s compaction process can save a lot of space by effectively dropping the timestamp value from stored samples that all occur at a consistent delta. With millions of samples, this can add up to gigabytes of storage space in every TSDB block. However, when the delta of the delta is not consistent, this is referred to as scrape jitter.

Scrape jitter is a way to say that scrapes do not occur at consistent intervals. In Prometheus, this often means that they are off by just a few milliseconds. By default, Prometheus will automatically adjust timestamps that are within a 2ms tolerance.

Configuring timestamp adjustments

Whether or not timestamps are...

Using pprof

On occasion, you may wish to have insight into how Prometheus is performing under the hood at a much lower level. For example, what specifically in the Prometheus code is causing my memory usage to be what it is? Or what is using so much CPU? Since Prometheus is written in Go, it can leverage Go’s native pprof system.

pprof is a tool that came out of Google (https://github.com/google/pprof) that is used to produce, visualize, and analyze profiling data from applications. Prometheus exposes pprof endpoints via HTTP under the /debug/pprof path (for example, http://localhost:9090/debug/pprof):

Figure 7.2 – The /debug/pprof page on a Prometheus server

Figure 7.2 – The /debug/pprof page on a Prometheus server

Although you can view all of these endpoints, such as /debug/pprof/heap and /debug/pprof/goroutine, in the browser, they are best used with the go command-line program. Go has the built-in ability to retrieve and visualize pprof data. For example, you can retrieve heap data like so...

Query logging and limits

Depending on your environment and use case, you may end up in a situation in which you are experiencing performance problems due to expensive queries being run or a high number of queries. Prometheus provides built-in ways to log completed queries, maintain a log of active queries, and set limits to prevent overly broad queries from pulling in too much data. Using these, we can both determine the source of issues and put safeguards in place to prevent them from happening.

Query logging

Query logging can be configured via the Prometheus configuration file under the global key with the query_log_file setting. This setting accepts a file path to which Prometheus will write all queries that it completes:

global:
    query_log_file: /var/log/prometheus/my_prometheus_queries.log

Each entry in the query log file includes helpful information such as the timestamp of the query, the query itself, and other useful statistics about the query...

Tuning garbage collection

The Go garbage collector (GC) is surprisingly simple if you’re coming from a background of other garbage-collected languages such as Java, which has a seemingly unlimited number of ways to tune garbage collection. There are a very limited number of ways to tune the GC, so we'll focus on the two primary ways: the GOGC and GOMEMLIMIT environment variables.

Until recently (Go 1.19), the GOGC environment variable was the only supported way to control garbage collection behavior when running a Go program. Effectively, the way that it works is by setting a percentage of how much the live heap size (memory) can increase from the previous garbage collection before the GC kicks in to mark and reclaim memory.

Note

This is an intentional oversimplification as other things, such as the memory size of goroutine stacks and global pointers, can also contribute to the overall memory size and are included in the percentage that memory can be increased...

Summary

In this chapter, we learned how to make our Prometheus instance more reliable and efficient by applying different tweaks and tunings. We also learned how to debug Prometheus when things go wrong. Now that we have a good handle on running our Prometheus instance, in the next chapter, we’ll take an in-depth look at one of the primary sources of metrics in the Prometheus ecosystem: Node Exporter.

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Mastering Prometheus
Published in: Apr 2024Publisher: PacktISBN-13: 9781805125662
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
William Hegedus

William Hegedus has worked in tech for over a decade in a variety of roles, culminating in site reliability engineering. He developed a keen interest in Prometheus and observability technologies during his time managing a 24/7 NOC environment and eventually became the first SRE at Linode, one of the foremost independent cloud providers. Linode was acquired by Akamai Technologies in 2022, and now Will manages a team of SREs focused on building the internal observability platform for Akamai's Connected Cloud. His team is responsible for a global fleet of Prometheus servers spanning over two dozen data centers and ingesting millions of data points every second, in addition to operating a suite of other observability tools. Will is an open source advocate and contributor who has contributed code to Prometheus, Thanos, and many other CNCF projects related to Kubernetes and observability. He lives in central Virginia with his wonderful wife, four kids, three cats, two dogs, and a bearded dragon.
Read more about William Hegedus