Reader small image

You're reading from  Hands-On Infrastructure Monitoring with Prometheus

Product typeBook
Published inMay 2019
PublisherPackt
ISBN-139781789612349
Edition1st Edition
Right arrow
Authors (2):
Joel Bastos
Joel Bastos
author image
Joel Bastos

Joel Bastos is an open source supporter and contributor, with a background in infrastructure security and automation. He is always striving for the standardization of processes, code maintainability, and code reusability. He has defined, led, and implemented critical, highly available, and fault-tolerant enterprise and web-scale infrastructures in several organizations, with Prometheus as the cornerstone. He has worked at two unicorn companies in Portugal and at one of the largest transaction-oriented gaming companies in the world. Previously, he has supported several governmental entities with projects such as the Public Key Infrastructure for the Portuguese citizen card. You can find his blogs at kintoandar and on Twitter with the handle @kintoandar.
Read more about Joel Bastos

Pedro Araújo
Pedro Araújo
author image
Pedro Araújo

Pedro Arajo is a site reliability and automation engineer and has defined and implemented several standards for monitoring at scale. His contributions have been fundamental in connecting development teams to infrastructure. He is highly knowledgeable about infrastructure, but his passion is in the automation and management of large-scale, highly-transactional systems. Pedro has contributed to several open source projects, such as Riemann, OpenTSDB, Sensu, Prometheus, and Thanos. You can find him on Twitter with the handle @phcrva.
Read more about Pedro Araújo

View More author details
Right arrow

Defining Alerting and Recording Rules

Recording rules are a useful concept of Prometheus. They allow you to speed up heavy queries and enable subqueries in PromQL that otherwise would be very expensive. Alerting rules are similar to recording rules, but with alert-specific semantics. As testing is a fundamental part of any system, you'll have the opportunity in this chapter to learn how to ensure that recording and alerting rules behave as expected before being deployed. Understanding these constructs will help make Prometheus faster and more robust, as well as enabling its alerting capabilities.

The following topics will be covered in this chapter:

  • Creating the test environment
  • How does rule evaluation work?
  • Setting up alerting in Prometheus
  • Testing your rules

Creating the test environment

In this chapter, we'll be focusing on the Prometheus server and we'll be deploying a new instance so that we can apply the concepts covered.

Deployment

Let's begin by creating a new instance of Prometheus and deploying it to the server:

  1. To create a new instance of Prometheus, move into the correct repository path:
cd chapter09/
  1. Ensure that no other test environments are running and spin up this chapter's environment:
vagrant global-status
vagrant up
  1. Validate the successful deployment of the test environment using the following code:
vagrant status

This will output the following:

Current machine states:

prometheus running (virtualbox)

The VM is running...

Understanding how rule evaluation works

Prometheus allows the periodic evaluation of PromQL expressions and the storage of the time series generated by them; these are called rules. There are two types of rules, as we'll see in this chapter. These rules are recording and alerting rules. They share the same evaluation engine, but have some variation in purpose, which we'll go into next.

The recording rules' evaluation results are saved into the Prometheus database as samples for the time series specified in the configuration. This type of rule can help take the load off of heavy dashboards by pre-computing expensive queries, aggregating raw data into a time series that can then be exported to external systems (such as higher-level Prometheus instances through federation, as described in Chapter 13, Scaling and Federating Prometheus), and can help to create compound...

Setting up alerting in Prometheus

So far, we have covered how PromQL can be invaluable in querying the collected data, but when we require an expression to be continuously evaluated so that an event is triggered when a defined condition is met, we're promptly stepping into alerting. We explained how alerting is one of the components of monitoring in Chapter 1, Monitoring Fundamentals. To be clear, Prometheus is not responsible for issuing email, Slack, or any other forms of notification; that is the responsibility of another service. This service is typically Alertmanager, which we'll go over in Chapter 11, Understanding and Extending Alertmanager. Prometheus leverages the power of alerting rules to push alerts, which we'll be covering next.

What is an alerting rule...

Testing your rules

In Chapter 8, Troubleshooting and Validation, we went through the features that promtool has to offer, with the exception of testing. The test rules subcommand can simulate the periodic ingestion of samples for several time series, use those series to evaluate recording and alerting rules, and then test whether the recorded series match what was configured as the expected results. Now that we understand recording and alerting rules, we'll look at how to ensure that they behave as expected, by creating unit tests and using promtool to validate our rules.

Recording rules tests

The promtool tool included in the Prometheus binary distribution allows us to define test cases to validate that the rules we...

Summary

In this chapter, we had the opportunity to observe a different way to produce a derivative time series. Recording rules help improve monitoring system stability and performance when recurrent heavy queries are required by pre-computing them into new time series that are comparatively cheap to consult. Alerting rules bring the power and flexibility of PromQL to alerts; they enable triggering alerts for complex and dynamic thresholds as well as targeting multiple instances or even different applications using a single alert rule. Having a good grasp on how delays are introduced in alerts will now help you tailor them to your needs, but remember, a little delay is better than noisy alerts. Finally, we explored how to create unit tests for our rules and validate them even before a Prometheus server is running.

The next chapter will step into another component of monitoring...

Questions

  1. What are the primary uses for recording rules?
  2. Why should you avoid setting different evaluation intervals in rule groups?
  3. If you were presented with the instance_job:latency_seconds_bucket:rate30s metric, what labels would you expect to find and what would be the expression used to record it?
  4. Why is using the sample value of an alert in the alert labels a bad idea?
  5. What is the pending state of an alert?
  6. How long would an alert wait between being triggered and transitioning to the firing state when the for clause is not specified?
  7. How can you test your rules without using Prometheus?

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Hands-On Infrastructure Monitoring with Prometheus
Published in: May 2019Publisher: PacktISBN-13: 9781789612349
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Authors (2)

author image
Joel Bastos

Joel Bastos is an open source supporter and contributor, with a background in infrastructure security and automation. He is always striving for the standardization of processes, code maintainability, and code reusability. He has defined, led, and implemented critical, highly available, and fault-tolerant enterprise and web-scale infrastructures in several organizations, with Prometheus as the cornerstone. He has worked at two unicorn companies in Portugal and at one of the largest transaction-oriented gaming companies in the world. Previously, he has supported several governmental entities with projects such as the Public Key Infrastructure for the Portuguese citizen card. You can find his blogs at kintoandar and on Twitter with the handle @kintoandar.
Read more about Joel Bastos

author image
Pedro Araújo

Pedro Arajo is a site reliability and automation engineer and has defined and implemented several standards for monitoring at scale. His contributions have been fundamental in connecting development teams to infrastructure. He is highly knowledgeable about infrastructure, but his passion is in the automation and management of large-scale, highly-transactional systems. Pedro has contributed to several open source projects, such as Riemann, OpenTSDB, Sensu, Prometheus, and Thanos. You can find him on Twitter with the handle @phcrva.
Read more about Pedro Araújo