You're reading from Mastering Prometheus

Product typeBook

Published inApr 2024

PublisherPackt

ISBN-139781805125662

Edition1st Edition

Concepts

DevOps

Author (1)

William Hegedus

Optimizing and Debugging Prometheus

Even as we scale Prometheus out to multiple replicas and possibly even shards, there will still be performance issues that we need to know how to identify and address. Consequently, this chapter will focus on how we can go about optimizing Prometheus to make the most of the resources it has and how to debug issues when they arise.

In this chapter, we’re going to cover the following main topics:

Controlling cardinality
Recording rules
Scrape jitter
Using pprof
Query logging
Tuning garbage collection

Let’s get started!

Technical requirements

For this chapter, we’ll be using the Kubernetes cluster and Prometheus environment we created in Chapter 2. Consequently, we’ll need the following tools installed to interact with it:

kubectl: https://kubernetes.io/docs/tasks/tools/#kubectl
helm: https://helm.sh/docs/intro/install/
go: https://go.dev/dl/

This chapter’s code is available at https://github.com/PacktPublishing/Mastering-Prometheus.

Controlling cardinality

In the previous chapter, we discussed the concept of data cardinality. To refresh your memory, cardinality refers to the measure of unique values in a dataset. We know that time series databases in general have difficulty managing high cardinality datasets without this resulting in severe impacts on query performance.

This cardinality issue isn’t even necessarily specific to time series databases – relational databases such as MySQL or PostgreSQL also need to take cardinality into account when selecting data. In relational databases, large tables are often partitioned to improve query performance. That’s not an option with Prometheus’s TSDB, especially since you could consider the data to already be partitioned since each block functions as an independent database. Consequently, the closest we can get to partitioning data based on something other than time is by sharding, as we discussed in the previous chapter.

However, even...

Recording rules

A simple way to improve the performance of your Prometheus queries is through the use of recording rules.

When performing complex queries over long time ranges, your query execution times can get quite long. If you’re putting these queries in dashboards in Grafana – or otherwise frequently using them – it can negatively impact user experience. Fortunately, Prometheus enables the pre-computation of these expensive queries via recording rules.

Recording rules differ from alerting rules because they produce new time series based on the results of the PromQL expression they evaluate. They can be defined alongside alerting rules, including within the same rule groups.

Recording rule conventions

Since recording rules produce new time series, they need to have unique names. Additionally, it should be made clear that a metric is created by a recording rule. Prometheus does not enforce naming restrictions for recording rules differently than...

Scrape jitter

Scrape jitter is the most common cause of oversized TSDB blocks that I have observed. Recall from Chapter 3 how – from the third scrape onwards – timestamp values are stored in the TSDB as they only store the delta of the delta of the sample timestamp. So long as this delta of the delta is 0, the TSDB’s compaction process can save a lot of space by effectively dropping the timestamp value from stored samples that all occur at a consistent delta. With millions of samples, this can add up to gigabytes of storage space in every TSDB block. However, when the delta of the delta is not consistent, this is referred to as scrape jitter.

Scrape jitter is a way to say that scrapes do not occur at consistent intervals. In Prometheus, this often means that they are off by just a few milliseconds. By default, Prometheus will automatically adjust timestamps that are within a 2ms tolerance.

Configuring timestamp adjustments

Whether or not timestamps are...

Using pprof

On occasion, you may wish to have insight into how Prometheus is performing under the hood at a much lower level. For example, what specifically in the Prometheus code is causing my memory usage to be what it is? Or what is using so much CPU? Since Prometheus is written in Go, it can leverage Go’s native pprof system.

pprof is a tool that came out of Google (https://github.com/google/pprof) that is used to produce, visualize, and analyze profiling data from applications. Prometheus exposes pprof endpoints via HTTP under the /debug/pprof path (for example, http://localhost:9090/debug/pprof):

Figure 7.2 – The /debug/pprof page on a Prometheus server

Although you can view all of these endpoints, such as /debug/pprof/heap and /debug/pprof/goroutine, in the browser, they are best used with the go command-line program. Go has the built-in ability to retrieve and visualize pprof data. For example, you can retrieve heap data like so...

Query logging and limits

Depending on your environment and use case, you may end up in a situation in which you are experiencing performance problems due to expensive queries being run or a high number of queries. Prometheus provides built-in ways to log completed queries, maintain a log of active queries, and set limits to prevent overly broad queries from pulling in too much data. Using these, we can both determine the source of issues and put safeguards in place to prevent them from happening.

Query logging

Query logging can be configured via the Prometheus configuration file under the global key with the query_log_file setting. This setting accepts a file path to which Prometheus will write all queries that it completes:

global:
    query_log_file: /var/log/prometheus/my_prometheus_queries.log

Each entry in the query log file includes helpful information such as the timestamp of the query, the query itself, and other useful statistics about the query...

Tuning garbage collection

The Go garbage collector (GC) is surprisingly simple if you’re coming from a background of other garbage-collected languages such as Java, which has a seemingly unlimited number of ways to tune garbage collection. There are a very limited number of ways to tune the GC, so we'll focus on the two primary ways: the GOGC and GOMEMLIMIT environment variables.

Until recently (Go 1.19), the GOGC environment variable was the only supported way to control garbage collection behavior when running a Go program. Effectively, the way that it works is by setting a percentage of how much the live heap size (memory) can increase from the previous garbage collection before the GC kicks in to mark and reclaim memory.

Note

This is an intentional oversimplification as other things, such as the memory size of goroutine stacks and global pointers, can also contribute to the overall memory size and are included in the percentage that memory can be increased...

Summary

In this chapter, we learned how to make our Prometheus instance more reliable and efficient by applying different tweaks and tunings. We also learned how to debug Prometheus when things go wrong. Now that we have a good handle on running our Prometheus instance, in the next chapter, we’ll take an in-depth look at one of the primary sources of metrics in the Prometheus ecosystem: Node Exporter.

William Hegedus has worked in tech for over a decade in a variety of roles, culminating in site reliability engineering. He developed a keen interest in Prometheus and observability technologies during his time managing a 24/7 NOC environment and eventually became the first SRE at Linode, one of the foremost independent cloud providers. Linode was acquired by Akamai Technologies in 2022, and now Will manages a team of SREs focused on building the internal observability platform for Akamai's Connected Cloud. His team is responsible for a global fleet of Prometheus servers spanning over two dozen data centers and ingesting millions of data points every second, in addition to operating a suite of other observability tools. Will is an open source advocate and contributor who has contributed code to Prometheus, Thanos, and many other CNCF projects related to Kubernetes and observability. He lives in central Virginia with his wonderful wife, four kids, three cats, two dogs, and a bearded dragon.
Read more about William Hegedus

Personalised recommendations for you

Based on your interests and search pattern

Designing and Implementing Microsoft Azure Networking Solutions

Designing and Implementing Microsoft Azure Networking Solutions Exam Ref AZ-700 is an all-encompassing guide to the AZ-700 exam and contains all the information you need to succeed in the world of virtual networking with Azure. With this book, you will be fully prepared for the exam and the world of cloud networking.

BookAug 2023524 pages

Microsoft 365 Security, Compliance, and Identity Administration

The Microsoft 365 Security, Compliance, and Identity Administration is a comprehensive guide that helps you employ Microsoft 365's robust suite of features and empowers you to optimize your administrative tasks.

BookAug 2023630 pages

Zero Trust Overview and Playbook Introduction

Get started on Zero Trust with this step-by-step playbook and learn everything you need to know for a successful Zero Trust journey with tailored guidance for every role, covering strategy, operations, architecture, implementation, and measuring success. This book will become an indispensable reference for everyone in your organization.

BookOct 2023240 pages

The Self-Taught Cloud Computing Engineer

This self-study book helps you master multiple clouds, including AWS, Azure, and GCP, and serves as a roadmap to becoming a certified cloud computing expert. The book will guide you to develop a professional cloud career by helping you build a broad cloud knowledge base, developing hands-on cloud computing skills, and getting cloud certified.

BookSep 2023472 pages

Technology Operating Models for Cloud and Edge

This book will help you build and create ownership of a technology operating model, as well as connect your leadership with engineering and operations, keeping your internal and external customers in mind. It provides practical tips on why, where, and how to make the cloud and edge platform paradigm sing for you, your team, and your organization.

BookAug 2023228 pages

Azure Architecture Explained

Azure is the preferred platform to build mission-critical and secure apps. This book provides comprehensive coverage of essential Azure products, services, and solutions vital for every solution architect's success. Elevate your knowledge and master the critical components of Azure to excel in your role with Azure Architecture Explained.

BookSep 2023446 pages

Pentesting Active Directory and Windows-based Infrastructure

This practical guide helps you explore the pentesting of Microsoft infrastructure in detail, and enhances your offensive skillset by showing you the different ways to perform security assessment. This book will help blue teamers and IT engineers get up to speed with possible security issues they may encounter in their Windows environments.

BookNov 2023360 pages

Practical Ansible

In Practical Ansible, you'll work with the latest release of Ansible and learn to solve complex issues quickly with the help of task-oriented scenarios. You'll start by installing and configuring Ansible to automate monotonous and repetitive IT tasks and get to grips with concepts such as playbooks, inventories, plugins, collections, and network modules.

BookSep 2023420 pages

Windows 11 for Enterprise Administrators

Microsoft’s launch of Windows 11 is a step toward satisfying the enterprise administrator’s needs for better management and enhanced user experience customization. This book provides the enterprise administrator with the knowledge needed to fully utilize the advanced feature set of Windows 11 Enterprise.

BookOct 2023286 pages

The Linux DevOps Handbook

This book is for software and IT professionals seeking knowledge on Linux systems and DevOps practices. This book will provide you with guidance and tools to learn and gain proficiency in managing Linux-based infrastructures and knowledge of DevOps.

BookNov 2023428 pages2

You're reading from Mastering Prometheus

Optimizing and Debugging Prometheus

Technical requirements

Controlling cardinality

Recording rules

Recording rule conventions

Scrape jitter

Using pprof

Query logging and limits

Query logging

Tuning garbage collection

Summary

Further reading

Unlock this book and the full library FREE for 7 days

Author (1)

Designing and Implementing Microsoft Azure Networking Solutions

Microsoft 365 Security, Compliance, and Identity Administration

The Microsoft 365 Security, Compliance, and Identity Administration is a comprehensive guide that helps you employ Microsoft 365's robust suite of features and empowers you to optimize your administrative tasks.

Zero Trust Overview and Playbook Introduction

The Self-Taught Cloud Computing Engineer

Technology Operating Models for Cloud and Edge

Azure Architecture Explained

Pentesting Active Directory and Windows-based Infrastructure

Practical Ansible

Windows 11 for Enterprise Administrators

The Linux DevOps Handbook

This book is for software and IT professionals seeking knowledge on Linux systems and DevOps practices. This book will provide you with guidance and tools to learn and gain proficiency in managing Linux-based infrastructures and knowledge of DevOps.