You're reading from Hands-On Infrastructure Monitoring with Prometheus

Product typeBook

Published inMay 2019

PublisherPackt

ISBN-139781789612349

Edition1st Edition

Tools

Prometheus

Concepts

Application Monitoring

Authors (2):

Joel Bastos

Pedro Araújo

View More author details

Assessments

Chapter 1, Monitoring Fundamentals

A consensual definition of monitoring is hard to come by because it quickly shifts from industry or even in job-specific contexts. The diversity of viewpoints, the components comprising the monitoring system, and even how the data is collected or used, are all factors that contribute to the struggle to reach a clear definition.
System administrators are interested in high resolution, low latency, high diversity data. Within this scope, the primary objective of monitoring is so that problems are discovered quickly and the root causes identified as soon as possible.
Low resolution, high latency, and high diversity data.
It depends on how broad you want to make the monitoring definition. Within the scope of this book, logging is not considered monitoring.
The monitoring service's location needs to be propagated to all targets. Staleness is...

Chapter 2, An Overview of the Prometheus Ecosystem

The main components are Prometheus, Alertmanager, Pushgateway, Native Instrumented Applications, Exporters, and Visualization solutions.
Only Prometheus and scrape targets (whether they are natively instrumented or use exporters) are essential for a Prometheus deployment. However, to have alert routing and management, you also need Alertmanager; Pushgateway is only required in very specific use cases, such as batch jobs; while Prometheus does have basic dashboarding functionality built in, Grafana can be added to the stack as the visualization option.
Not all applications are built with Prometheus-compatible instrumentation. Sometimes, no metrics at all are exposed. In these cases, we can rely on exporters.
The information should be quickly gathered and exposed in a synchronous operation.
Alerts will be sent from both sides of...

Chapter 3, Setting Up a Test Environment

While the Prometheus stack can be deployed in almost every mainstream operating system, and thus, it will most certainly run in your desktop environment, it is more reproducible to use a Vagrant-based test environment for simulating machine deployments, and minikube to do the same for Kubernetes-based production environments.
The defaults.sh file located in the utils directory allows the software versions to be changed for the virtual machine-based examples.
The default subnet is 192.168.42.0/24 in all virtual machine-based examples.
The steps to get a Prometheus instance up and running are as follows:
1. Ensure that software versions match the ones recommended.
2. Clone the code repository provided.
3. Move into the chapter directory.
4. Run vagrant up.
5. When finished, run vagrant destroy -f.

That information is available in the Prometheus web...

Chapter 4, Prometheus Metrics Fundamentals

Time series data can be defined as a sequence of numerical data points collected chronologically from the same source – usually at a fixed interval. As such, this kind of data, when represented in a graphical form, will plot the evolution of the data through time, with the x-axis being time and the y-axis the data value.
A timestamp, a value, and tags/labels.
The write-ahead log (WAL).
The default is 2h and should not be changed.
A float64 value and a timestamp with millisecond precision.
Histograms are especially useful for tracking bucketed latencies and sizes (for example, request durations or response sizes) as they can be freely aggregated across different dimensions. Another great use is to generate heatmaps (the evolution of histograms over time).
Summaries without quantiles are quite cheap to generate, collect, and store...

Chapter 5, Running a Prometheus Server

Then, scrape_timeout will be set to its default – 10 seconds.
Besides restarting, the configuration file can be reloaded by either sending a SIGHUP signal to the Prometheus process or sending an HTTP POST request to the /-/reload endpoint if --web.enable-lifecycle is used at startup.
Prometheus will look back up to five minutes by default, unless it finds a stale marker, in which case it will immediately consider the series stale.

While relabel_configs is used to rewrite the target list before the scrape is performed, metric_relabel_configs is used to rewrite labels or drop samples after the scrape has occurred.
As we're scraping through a Kubernetes service (which is similar in function to a load balancer), the scrapes will hit only a single instance of the Hey application at a time.
Due to the ephemeral nature of Kubernetes...

Chapter 6, Exporters and Integrations

The textfile collector enables the exposition of custom metrics by watching a directory for files with the .prom extension that contain metrics in the Prometheus exposition format.
Data is collected from the container runtime daemon and from Linux cgroups.
You can restrict the number of collectors (--collectors) to enable, or use the metric whitelist (--metric-whitelist) or blacklist (--metric-blacklist) flags.
When debugging probes, you can append &debug=true to the HTTP GET URL to enable debug information.
We can use mtail or grok_exporter to extract metrics from the application logs.
One possible problem is the lack of high availability, making it a single point of failure. This also impacts scalability, as the only way to scale is vertically or by sharding. By using Pushgateway, Prometheus does not scrape an instance directly, which...

Chapter 7, Prometheus Query Language - PromQL

The comparison operators are < (less than), > (greater than), == (equals), != (differs), => (greater than or equal to), and <= (less than or equal to).
When the time series you want to enrich are on the right-hand side of the PromQL expression.
topk already sorts its results.
While the rate() function provides the per-second average rate of change over the specified interval by using the first and last values in the range scaled to fit the range window, the irate() function uses the last two values in the range for the calculation, which produces the instant rate of change.
Metrics of type info have their names ending in _info and are regular gauges with one possible value, 1. This special kind of metric was designed to be a place where labels whose values might change over time are stored, such as versions (for example...

Chapter 8, Troubleshooting and Validation

Prometheus is distributed with promtool which, among other functions, can check a configuration file for issues:

promtool check config /etc/prometheus/prometheus.yml

The promtool utility can also read metrics in the Prometheus exposition format from stdin and validate them according to the current Prometheus standards:

curl -s http://prometheus:9090/metrics | promtool check metrics

The promtool utility can be used to run instant queries against a Prometheus instance:

promtool query instant 'http://prometheus:9090' 'up == 1'

You can use promtool to find every label value for a given label name. One example is the following:

promtool query labels 'http://prometheus:9090' 'mountpoint'

By adding --log.level=debug to the start-up parameters.
The /-/healthy endpoint will tell you (or the orchestration...

Chapter 9, Defining Alerting and Recording Rules

This type of rules can help take the load off heavy dashboards by pre-computing expensive queries, aggregate raw data into time series that can then be exported to external systems, and assist the creation of compound range vector queries.
For the same reasons as in scrape jobs, queries might produce erroneous results when using series with different sampling rates, and having to keep track of what series have what periodicity becomes unmanageable.
instance_job:latency_seconds_bucket:rate30s needs to have at least the instance and job labels. It was calculated by applying the rate to the latency_seconds_bucket_total metric, using a 30-second range vector. Thus, the originating expression could probably be as follows:

rate(latency_seconds_bucket_total[30s])

As that label changes its value, so will the identity of the alert.
An...

Chapter 10, Discovering and Creating Grafana Dashboards

Grafana supports automatic provisioning of data sources by reading YAML definitions from a provisioning path at startup.
Steps to import a dashboard from the Grafana gallery are as follows:
1. Choose a dashboard ID from the grafana.com gallery.
2. In the target Grafana instance, click on the plus sign in the main menu on the left-hand side and select Import from the sub-menu.
3. Paste the chosen ID in the appropriate text field.

Variables allow a dashboard to configure placeholders that can be used in expressions and title strings, and those placeholders can be filled with values from either a static or dynamic list, which are usually presented to the dashboard user in the form of a drop-down menu. Whenever the selected value changes, Grafana will automatically update the queries in panels and title strings that use that respective...

Chapter 11, Understanding and Extending Alertmanager

In the case of a network partition, each side of the partition will send notifications for the alerts they are aware of: in a clustering failure scenario, it's better to receive duplicate notifications for an issue than to not get any at all.
By setting continue to true on a route, it will make the matching process keep going through the routing tree until the next match, thereby allowing multiple receivers to be triggered.
The group_interval configuration defines how long to wait for additional alerts in a given alert group (defined by group_by) before sending an updated notification when a new alert is received; repeat_interval defines how long to wait until resending notifications for a given alert group when there are no changes.
The top-level route, also known as the catch-all or fallback route, will trigger a default...

Chapter 12, Choosing the Right Service Discovery

Managing scrape targets in a highly dynamic environment becomes an arduous task without automatic discovery.
Having a set of access credentials with sufficient permissions to list all the required resources through its API.
It supports A, AAAA, and SRV DNS records.
Due to the large number of API objects available to query, the Kubernetes discovery configuration for Prometheus has the concept of role, which can be either node, service, pod, endpoint, or ingress. Each will make available their corresponding set of objects for target discovery.
The best mechanism for implementing a custom service discovery is to use file-based discovery integration to inject targets into Prometheus.
No. Prometheus will try to use filesystem watches to automatically detect when there are changes and then reload the target list, and will fall back to...

Chapter 13, Scaling and Federating Prometheus

You should consider sharding when you're sure a single instance isn't enough to handle the load, and you can't run it with more resources.
Vertical sharding is used to split scrape workload according to responsibility (for example, by function or team), where each Prometheus shard scrapes different jobs. Horizontal sharding splits loads from a single scrape job into multiple Prometheus instances.
To reduce the ingestion load on a Prometheus instance, you should consider dropping unnecessary metrics through the use of metric_relabel_configs rules, or by increasing the scrape interval so that fewer samples are ingested in total.
Instance-level Prometheus servers should federate job-level aggregate metrics. Job-level Prometheus servers should federate datacenter-level aggregate metrics.
You might need to use metrics only...

Chapter 14, Integrating Long-Term Storage with Prometheus

The main advantages of basing the remote write feature on the WAL are: it makes streaming of metrics possible, has a much smaller memory footprint, and it’s more resilient to crashes.
You can request Prometheus to produce a snapshot of the TSDB by using the /api/v1/admin/tsdb/snapshot API endpoint (only available when the --web.enable-admin-api flag is enabled), and then back up the snapshot.
You can delete time series from the TSDB by using the /api/v1/admin/tsdb/delete_series API endpoint and then using the /api/v1/admin/tsdb/clean_tombstones to make Prometheus clean up the deleted series (these endpoints will only be available when the --web.enable-admin-api flag is enabled).

Object storage usually provides 99.999999999% durability and 99.99% availability service-level agreements, and it’s quite cheap...

The rest of the chapter is locked

You have been reading a chapter from

Hands-On Infrastructure Monitoring with Prometheus

Published in: May 2019Publisher: PacktISBN-13: 9781789612349

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at $15.99/month. Cancel anytime

Authors (2)

Joel Bastos

Joel Bastos is an open source supporter and contributor, with a background in infrastructure security and automation. He is always striving for the standardization of processes, code maintainability, and code reusability. He has defined, led, and implemented critical, highly available, and fault-tolerant enterprise and web-scale infrastructures in several organizations, with Prometheus as the cornerstone. He has worked at two unicorn companies in Portugal and at one of the largest transaction-oriented gaming companies in the world. Previously, he has supported several governmental entities with projects such as the Public Key Infrastructure for the Portuguese citizen card. You can find his blogs at kintoandar and on Twitter with the handle @kintoandar.
Read more about Joel Bastos

Pedro Araújo

Pedro Arajo is a site reliability and automation engineer and has defined and implemented several standards for monitoring at scale. His contributions have been fundamental in connecting development teams to infrastructure. He is highly knowledgeable about infrastructure, but his passion is in the automation and management of large-scale, highly-transactional systems. Pedro has contributed to several open source projects, such as Riemann, OpenTSDB, Sensu, Prometheus, and Thanos. You can find him on Twitter with the handle @phcrva.
Read more about Pedro Araújo

Personalised recommendations for you

Based on your interests and search pattern

C++ Programming for Linux Systems

This book covers the essential system programming tools and helps you explore the features of C++20. It emphasizes important details to maintain code quality and tackle everyday challenges of developing software for high performance, optimization, and more.

BookSep 2023288 pages

Expert C++

Discover advanced programming techniques, the latest features of C++17 and C++20, and best practices for memory management, debugging, testing, and large-scale application design with Expert C++. Ideal for experienced developers advancing to proficient programmers and building professional-grade C++ applications.

BookAug 2023604 pages

iOS 17 Programming for Beginners

iOS 17 Programming for Beginners, Eighth Edition is your comprehensive guide to learning the art of iOS app development. Whether you dream of creating the next chart-topping app or simply want to enhance your programming skills, this book is your trusted companion on this exciting journey.

BookOct 2023604 pages4

Developer Career Masterplan

Written by industry experts that have spent the last 20+ years helping developers grow their career path towards senior developer positions and beyond. This book provides a comprehensive guide, sharing examples and stories from their global careers. By the end, you’ll have the knowledge to create a clear career progression plan as a technical professional.

BookSep 2023310 pages

Refactoring with C#

In Refactoring with C#, you’ll explore the process of safely refactoring modern .NET code using Visual Studio features, advanced unit tests, AI assistance, and custom Roslyn analyzers.

BookNov 2023434 pages

Python Real-World Projects

Amplify your developer journey by curating a dynamic project portfolio that outshines traditional resumes. Delve into the Python realm through immersive projects, mastering core concepts while constructing comprehensive modules and applications. From data acquisition prowess to impactful data visualization, Python Real-World Projects arms you with essential skills to beat the competition.

BookSep 2023478 pages5

The MVVM Pattern in .NET MAUI

The MVVM Pattern in .NET MAUI enables developers to master MVVM principles and effectively apply them to .NET MAUI. This book uses real-life examples and covers complex problems to help you successfully apply MVVM with .NET MAUI to confidently develop robust and high-performing cross-platform apps.

BookNov 2023386 pages

Extending Microsoft Business Central with Power Platform

Extending Business Central with the Power Platform is a step-by-step guide for Business Central professionals to create solutions that automate business processes, explain complex workflow approvals, and integrate with hundreds of other systems, without traditional development. It’ll guide you in customizing Business Central with Power Platform.

BookAug 2023458 pages5

Extending Microsoft Business Central with Power Platform

Extending Business Central with the Power Platform is a step-by-step guide for Business Central professionals to create solutions that automate business processes, explain complex workflow approvals, and integrate with hundreds of other systems, without traditional development. It’ll guide you in customizing Business Central with Power Platform.

BookAug 2023458 pages5

Quantum Computing Algorithms

The book emphasizes intuitive ideas behind quantum algorithms in ways that other books don’t cover, striking a careful balance between no math and too much math. To get the most from this book, you should be comfortable with basic algebra and writing simple computer code. No prior understanding of quantum physics is needed to get started.

BookSep 2023342 pages

Python – Complete Python, Django, Data Science and ML Guide

Unlock Python's full potential with this 50+ hour course! From programming to web and game development, data manipulation, and machine learning, gain the skills required to succeed in various Python-related careers. With practical tasks, hands-on experience, and a strong foundation in Python, you'll be ready to tackle real-world challenges and take advantage of the many opportunities this versatile language offers.

VideoNov 202350 hours 30 minutes5

Python – Complete Python, Django, Data Science and ML Guide

Unlock Python's full potential with this 50+ hour course! From programming to web and game development, data manipulation, and machine learning, gain the skills required to succeed in various Python-related careers. With practical tasks, hands-on experience, and a strong foundation in Python, you'll be ready to tackle real-world challenges and take advantage of the many opportunities this versatile language offers.

VideoNov 202350 hours 30 minutes5