Reader small image

You're reading from  Hands-On Infrastructure Monitoring with Prometheus

Product typeBook
Published inMay 2019
PublisherPackt
ISBN-139781789612349
Edition1st Edition
Right arrow
Authors (2):
Joel Bastos
Joel Bastos
author image
Joel Bastos

Joel Bastos is an open source supporter and contributor, with a background in infrastructure security and automation. He is always striving for the standardization of processes, code maintainability, and code reusability. He has defined, led, and implemented critical, highly available, and fault-tolerant enterprise and web-scale infrastructures in several organizations, with Prometheus as the cornerstone. He has worked at two unicorn companies in Portugal and at one of the largest transaction-oriented gaming companies in the world. Previously, he has supported several governmental entities with projects such as the Public Key Infrastructure for the Portuguese citizen card. You can find his blogs at kintoandar and on Twitter with the handle @kintoandar.
Read more about Joel Bastos

Pedro Araújo
Pedro Araújo
author image
Pedro Araújo

Pedro Arajo is a site reliability and automation engineer and has defined and implemented several standards for monitoring at scale. His contributions have been fundamental in connecting development teams to infrastructure. He is highly knowledgeable about infrastructure, but his passion is in the automation and management of large-scale, highly-transactional systems. Pedro has contributed to several open source projects, such as Riemann, OpenTSDB, Sensu, Prometheus, and Thanos. You can find him on Twitter with the handle @phcrva.
Read more about Pedro Araújo

View More author details
Right arrow

Assessments

Chapter 1, Monitoring Fundamentals

  1. A consensual definition of monitoring is hard to come by because it quickly shifts from industry or even in job-specific contexts. The diversity of viewpoints, the components comprising the monitoring system, and even how the data is collected or used, are all factors that contribute to the struggle to reach a clear definition.
  2. System administrators are interested in high resolution, low latency, high diversity data. Within this scope, the primary objective of monitoring is so that problems are discovered quickly and the root causes identified as soon as possible.
  3. Low resolution, high latency, and high diversity data.
  4. It depends on how broad you want to make the monitoring definition. Within the scope of this book, logging is not considered monitoring.
  5. The monitoring service's location needs to be propagated to all targets. Staleness is...

Chapter 2, An Overview of the Prometheus Ecosystem

  1. The main components are Prometheus, Alertmanager, Pushgateway, Native Instrumented Applications, Exporters, and Visualization solutions.
  2. Only Prometheus and scrape targets (whether they are natively instrumented or use exporters) are essential for a Prometheus deployment. However, to have alert routing and management, you also need Alertmanager; Pushgateway is only required in very specific use cases, such as batch jobs; while Prometheus does have basic dashboarding functionality built in, Grafana can be added to the stack as the visualization option.
  3. Not all applications are built with Prometheus-compatible instrumentation. Sometimes, no metrics at all are exposed. In these cases, we can rely on exporters.
  4. The information should be quickly gathered and exposed in a synchronous operation.
  5. Alerts will be sent from both sides of...

Chapter 3, Setting Up a Test Environment

  1. While the Prometheus stack can be deployed in almost every mainstream operating system, and thus, it will most certainly run in your desktop environment, it is more reproducible to use a Vagrant-based test environment for simulating machine deployments, and minikube to do the same for Kubernetes-based production environments.
  2. The defaults.sh file located in the utils directory allows the software versions to be changed for the virtual machine-based examples.
  3. The default subnet is 192.168.42.0/24 in all virtual machine-based examples.
  4. The steps to get a Prometheus instance up and running are as follows:
    1. Ensure that software versions match the ones recommended.
    2. Clone the code repository provided.
    3. Move into the chapter directory.
    4. Run vagrant up.
    5. When finished, run vagrant destroy -f.
  1. That information is available in the Prometheus web...

Chapter 4, Prometheus Metrics Fundamentals

  1. Time series data can be defined as a sequence of numerical data points collected chronologically from the same source – usually at a fixed interval. As such, this kind of data, when represented in a graphical form, will plot the evolution of the data through time, with the x-axis being time and the y-axis the data value.
  2. A timestamp, a value, and tags/labels.
  3. The write-ahead log (WAL).
  4. The default is 2h and should not be changed.
  5. A float64 value and a timestamp with millisecond precision.
  6. Histograms are especially useful for tracking bucketed latencies and sizes (for example, request durations or response sizes) as they can be freely aggregated across different dimensions. Another great use is to generate heatmaps (the evolution of histograms over time).
    Summaries without quantiles are quite cheap to generate, collect, and store...

Chapter 5, Running a Prometheus Server

  1. Then, scrape_timeout will be set to its default – 10 seconds.
  2. Besides restarting, the configuration file can be reloaded by either sending a SIGHUP signal to the Prometheus process or sending an HTTP POST request to the /-/reload endpoint if --web.enable-lifecycle is used at startup.
  3. Prometheus will look back up to five minutes by default, unless it finds a stale marker, in which case it will immediately consider the series stale.
  1. While relabel_configs is used to rewrite the target list before the scrape is performed, metric_relabel_configs is used to rewrite labels or drop samples after the scrape has occurred.
  2. As we're scraping through a Kubernetes service (which is similar in function to a load balancer), the scrapes will hit only a single instance of the Hey application at a time.
  3. Due to the ephemeral nature of Kubernetes...

Chapter 6, Exporters and Integrations

  1. The textfile collector enables the exposition of custom metrics by watching a directory for files with the .prom extension that contain metrics in the Prometheus exposition format.
  2. Data is collected from the container runtime daemon and from Linux cgroups.
  3. You can restrict the number of collectors (--collectors) to enable, or use the metric whitelist (--metric-whitelist) or blacklist (--metric-blacklist) flags.
  4. When debugging probes, you can append &debug=true to the HTTP GET URL to enable debug information.
  5. We can use mtail or grok_exporter to extract metrics from the application logs.
  6. One possible problem is the lack of high availability, making it a single point of failure. This also impacts scalability, as the only way to scale is vertically or by sharding. By using Pushgateway, Prometheus does not scrape an instance directly, which...

Chapter 7, Prometheus Query Language - PromQL

  1. The comparison operators are < (less than), > (greater than), == (equals), != (differs), => (greater than or equal to), and <= (less than or equal to).
  2. When the time series you want to enrich are on the right-hand side of the PromQL expression.
  3. topk already sorts its results.
  4. While the rate() function provides the per-second average rate of change over the specified interval by using the first and last values in the range scaled to fit the range window, the irate() function uses the last two values in the range for the calculation, which produces the instant rate of change.
  5. Metrics of type info have their names ending in _info and are regular gauges with one possible value, 1. This special kind of metric was designed to be a place where labels whose values might change over time are stored, such as versions (for example...

Chapter 8, Troubleshooting and Validation

  1. Prometheus is distributed with promtool which, among other functions, can check a configuration file for issues:
promtool check config /etc/prometheus/prometheus.yml
  1. The promtool utility can also read metrics in the Prometheus exposition format from stdin and validate them according to the current Prometheus standards:
curl -s http://prometheus:9090/metrics | promtool check metrics
  1. The promtool utility can be used to run instant queries against a Prometheus instance:
promtool query instant 'http://prometheus:9090' 'up == 1'
  1. You can use promtool to find every label value for a given label name. One example is the following:
promtool query labels 'http://prometheus:9090' 'mountpoint'
  1. By adding --log.level=debug to the start-up parameters.
  2. The /-/healthy endpoint will tell you (or the orchestration...

Chapter 9, Defining Alerting and Recording Rules

  1. This type of rules can help take the load off heavy dashboards by pre-computing expensive queries, aggregate raw data into time series that can then be exported to external systems, and assist the creation of compound range vector queries.
  2. For the same reasons as in scrape jobs, queries might produce erroneous results when using series with different sampling rates, and having to keep track of what series have what periodicity becomes unmanageable.
  3. instance_job:latency_seconds_bucket:rate30s needs to have at least the instance and job labels. It was calculated by applying the rate to the latency_seconds_bucket_total metric, using a 30-second range vector. Thus, the originating expression could probably be as follows:
rate(latency_seconds_bucket_total[30s])
  1. As that label changes its value, so will the identity of the alert.
  2. An...

Chapter 10, Discovering and Creating Grafana Dashboards

  1. Grafana supports automatic provisioning of data sources by reading YAML definitions from a provisioning path at startup.
  2. Steps to import a dashboard from the Grafana gallery are as follows:
    1. Choose a dashboard ID from the grafana.com gallery.
    2. In the target Grafana instance, click on the plus sign in the main menu on the left-hand side and select Import from the sub-menu.
    3. Paste the chosen ID in the appropriate text field.
  1. Variables allow a dashboard to configure placeholders that can be used in expressions and title strings, and those placeholders can be filled with values from either a static or dynamic list, which are usually presented to the dashboard user in the form of a drop-down menu. Whenever the selected value changes, Grafana will automatically update the queries in panels and title strings that use that respective...

Chapter 11, Understanding and Extending Alertmanager

  1. In the case of a network partition, each side of the partition will send notifications for the alerts they are aware of: in a clustering failure scenario, it's better to receive duplicate notifications for an issue than to not get any at all.
  2. By setting continue to true on a route, it will make the matching process keep going through the routing tree until the next match, thereby allowing multiple receivers to be triggered.
  3. The group_interval configuration defines how long to wait for additional alerts in a given alert group (defined by group_by) before sending an updated notification when a new alert is received; repeat_interval defines how long to wait until resending notifications for a given alert group when there are no changes.
  4. The top-level route, also known as the catch-all or fallback route, will trigger a default...

Chapter 12, Choosing the Right Service Discovery

  1. Managing scrape targets in a highly dynamic environment becomes an arduous task without automatic discovery.
  2. Having a set of access credentials with sufficient permissions to list all the required resources through its API.
  3. It supports A, AAAA, and SRV DNS records.
  4. Due to the large number of API objects available to query, the Kubernetes discovery configuration for Prometheus has the concept of role, which can be either node, service, pod, endpoint, or ingress. Each will make available their corresponding set of objects for target discovery.
  5. The best mechanism for implementing a custom service discovery is to use file-based discovery integration to inject targets into Prometheus.
  6. No. Prometheus will try to use filesystem watches to automatically detect when there are changes and then reload the target list, and will fall back to...

Chapter 13, Scaling and Federating Prometheus

  1. You should consider sharding when you're sure a single instance isn't enough to handle the load, and you can't run it with more resources.
  2. Vertical sharding is used to split scrape workload according to responsibility (for example, by function or team), where each Prometheus shard scrapes different jobs. Horizontal sharding splits loads from a single scrape job into multiple Prometheus instances.
  3. To reduce the ingestion load on a Prometheus instance, you should consider dropping unnecessary metrics through the use of metric_relabel_configs rules, or by increasing the scrape interval so that fewer samples are ingested in total.
  4. Instance-level Prometheus servers should federate job-level aggregate metrics. Job-level Prometheus servers should federate datacenter-level aggregate metrics.
  5. You might need to use metrics only...

Chapter 14, Integrating Long-Term Storage with Prometheus

  1. The main advantages of basing the remote write feature on the WAL are: it makes streaming of metrics possible, has a much smaller memory footprint, and it’s more resilient to crashes.
  2. You can request Prometheus to produce a snapshot of the TSDB by using the /api/v1/admin/tsdb/snapshot API endpoint (only available when the --web.enable-admin-api flag is enabled), and then back up the snapshot.
  3. You can delete time series from the TSDB by using the /api/v1/admin/tsdb/delete_series API endpoint and then using the /api/v1/admin/tsdb/clean_tombstones to make Prometheus clean up the deleted series (these endpoints will only be available when the --web.enable-admin-api flag is enabled).
  1. Object storage usually provides 99.999999999% durability and 99.99% availability service-level agreements, and it’s quite cheap...
lock icon
The rest of the chapter is locked
You have been reading a chapter from
Hands-On Infrastructure Monitoring with Prometheus
Published in: May 2019Publisher: PacktISBN-13: 9781789612349
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Authors (2)

author image
Joel Bastos

Joel Bastos is an open source supporter and contributor, with a background in infrastructure security and automation. He is always striving for the standardization of processes, code maintainability, and code reusability. He has defined, led, and implemented critical, highly available, and fault-tolerant enterprise and web-scale infrastructures in several organizations, with Prometheus as the cornerstone. He has worked at two unicorn companies in Portugal and at one of the largest transaction-oriented gaming companies in the world. Previously, he has supported several governmental entities with projects such as the Public Key Infrastructure for the Portuguese citizen card. You can find his blogs at kintoandar and on Twitter with the handle @kintoandar.
Read more about Joel Bastos

author image
Pedro Araújo

Pedro Arajo is a site reliability and automation engineer and has defined and implemented several standards for monitoring at scale. His contributions have been fundamental in connecting development teams to infrastructure. He is highly knowledgeable about infrastructure, but his passion is in the automation and management of large-scale, highly-transactional systems. Pedro has contributed to several open source projects, such as Riemann, OpenTSDB, Sensu, Prometheus, and Thanos. You can find him on Twitter with the handle @phcrva.
Read more about Pedro Araújo