Reader small image

You're reading from  Hands-On Infrastructure Monitoring with Prometheus

Product typeBook
Published inMay 2019
PublisherPackt
ISBN-139781789612349
Edition1st Edition
Right arrow
Authors (2):
Joel Bastos
Joel Bastos
author image
Joel Bastos

Joel Bastos is an open source supporter and contributor, with a background in infrastructure security and automation. He is always striving for the standardization of processes, code maintainability, and code reusability. He has defined, led, and implemented critical, highly available, and fault-tolerant enterprise and web-scale infrastructures in several organizations, with Prometheus as the cornerstone. He has worked at two unicorn companies in Portugal and at one of the largest transaction-oriented gaming companies in the world. Previously, he has supported several governmental entities with projects such as the Public Key Infrastructure for the Portuguese citizen card. You can find his blogs at kintoandar and on Twitter with the handle @kintoandar.
Read more about Joel Bastos

Pedro Araújo
Pedro Araújo
author image
Pedro Araújo

Pedro Arajo is a site reliability and automation engineer and has defined and implemented several standards for monitoring at scale. His contributions have been fundamental in connecting development teams to infrastructure. He is highly knowledgeable about infrastructure, but his passion is in the automation and management of large-scale, highly-transactional systems. Pedro has contributed to several open source projects, such as Riemann, OpenTSDB, Sensu, Prometheus, and Thanos. You can find him on Twitter with the handle @phcrva.
Read more about Pedro Araújo

View More author details
Right arrow

Troubleshooting and Validation

Troubleshooting is, in itself, an art and, in this chapter, we will provide some useful guidelines on how to quickly detect and fix problems. You will discover useful endpoints that expose critical information, learn about promtool, Prometheus' command-line interface and validation tool, and how to integrate it into your daily workflow. Finally, we'll look into the Prometheus database and collect insightful information regarding its usage.

In brief, the following topics will be covered in this chapter:

  • The test environment for this chapter
  • Exploring promtool
  • Logs and endpoint validation
  • Analyzing the time series database

The test environment for this chapter

In this chapter, we'll be focusing on the Prometheus server and will be deploying a new instance so that we can apply the concepts covered in this chapter using a new test environment.

Deployment

To create a new instance of Prometheus, move into the correct repository path:

cd chapter08/

Ensure that no other test environments are running and spin up this chapter's environment:

vagrant global-status
vagrant up

You can validate the successful deployment of the test environment using the following:

vagrant status

This should output the following:

Current machine states:

prometheus running (virtualbox)

The VM is running. To stop this VM, you can run `vagrant halt` to shut it down forcefully...

Exploring promtool

Prometheus ships with a very useful supporting command-line tool called promtool. This small Golang binary can be used to quickly perform several troubleshooting actions and is packed with helpful subcommands.

The features available can be divided into four categories, which we'll be covering next.

Checks

The subcommands that belong to this category provide the user with the ability to check and validate several configuration aspects of the Prometheus server and metric standards compliance. The following sections depict their usage.

check config

...

Logs and endpoint validation

In the next sections, we go through several useful HTTP endpoints and service logs that can be fundamental to troubleshoot issues with a Prometheus instance.

Endpoints

Checking whether Prometheus is up and running is usually very simple, as it follows the conventions most cloud-native applications use for service health: one endpoint to check whether the service is healthy and another to check whether it is ready to start handling incoming requests. For those who use or have used Kubernetes in the past, these might sound familiar; in fact, Kubernetes also uses these conventions to assess whether a container needs to be restarted (for example, if the application deadlocks and stops responding to...

Analyzing the time series database

A critical component of the Prometheus server is its time series database. Being able to analyze the usage of this database is essential to detect series churn and cardinality problems. Churn, in this context, refers to time series that become stale (for example, from the origin target stop being collected or the series disappearing from one scrape to the next), and a new series with slightly different identity starts being collected next. A usual example of churn is related to Kubernetes application deploys, where the pod instance IP address changes making the previous time series obsolete, and replacing it with a new one. This impacts performance when querying, as samples with – possibly – no relevance are returned.

Thankfully, there's an obscure tool within the source code for the Prometheus database that allows analyzing...

Summary

In this chapter, we had the opportunity to experiment with a couple of useful tools to troubleshoot and analyze Prometheus configuration issues and performance. We started with promtool and went through all its available options; then, we used several endpoints and logs to ensure everything was working as expected. Finally, we described the tsdb tool and how it can be used to troubleshoot and pinpoint problems with cardinality and the churn of metrics and labels in our Prometheus database.

We can now step into recording and alerting rules, which will be covered in the next chapter.

Questions

  1. How can you validate whether the main Prometheus configuration file has an issue?
  2. How can you assess whether metrics exposed by a target are up to Prometheus standards?
  3. Using promtool, how would you perform an instant query?
  4. How can you find all the label values being used?
  5. How do you enable debug logs on the Prometheus server?
  6. What's the difference between ready and healthy endpoints?
  7. How can you find the churn of labels on an old block of Prometheus data?

Further reading

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Hands-On Infrastructure Monitoring with Prometheus
Published in: May 2019Publisher: PacktISBN-13: 9781789612349
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Authors (2)

author image
Joel Bastos

Joel Bastos is an open source supporter and contributor, with a background in infrastructure security and automation. He is always striving for the standardization of processes, code maintainability, and code reusability. He has defined, led, and implemented critical, highly available, and fault-tolerant enterprise and web-scale infrastructures in several organizations, with Prometheus as the cornerstone. He has worked at two unicorn companies in Portugal and at one of the largest transaction-oriented gaming companies in the world. Previously, he has supported several governmental entities with projects such as the Public Key Infrastructure for the Portuguese citizen card. You can find his blogs at kintoandar and on Twitter with the handle @kintoandar.
Read more about Joel Bastos

author image
Pedro Araújo

Pedro Arajo is a site reliability and automation engineer and has defined and implemented several standards for monitoring at scale. His contributions have been fundamental in connecting development teams to infrastructure. He is highly knowledgeable about infrastructure, but his passion is in the automation and management of large-scale, highly-transactional systems. Pedro has contributed to several open source projects, such as Riemann, OpenTSDB, Sensu, Prometheus, and Thanos. You can find him on Twitter with the handle @phcrva.
Read more about Pedro Araújo