Metrics Collection and Visualization

"What's measured improves."

- Peter Drucker

In the previous chapters, we converted our initial monolithic application into a set of microservices that are now running distributed inside our Kubernetes cluster. This paradigm shift introduced a new item to our list of project requirements: as system operators, we must be able to monitor the health of each individual service and be notified when problems arise.

We will begin this chapter by comparing the strengths and weaknesses of popular systems for capturing and aggregating metrics. Then we will focus our attention on Prometheus, a popular metrics collection system written entirely in Go. We will explore approaches for instrumenting our code to facilitate the efficient collection and export of metrics. In the last part of this chapter, we will investigate the use of Grafana for...

Technical requirements

The full code for the topics that will be discussed in this chapter has been published in this book's GitHub repository under the Chapter13 folder.

You can access this book's GitHub repository, which contains all the code and required resources for the chapters in this book, by pointing your web browser to the following URL: https://github.com/PacktPublishing/Hands-On-Software-Engineering-with-Golang.

To get you up and running as quickly as possible, each example project includes a Makefile that defines the following set of targets:

Makefile target	Description
`deps`	Install any required dependencies
`test`	Run all tests and report coverage
`lint`	Check for lint errors

As with the other chapters in this book, you will need a fairly recent version of Go, which you can download from https://golang.org/dl.

To run some of the code in this chapter...

Monitoring from the perspective of a site reliability engineer

As we saw in Chapter 1, A Bird's-Eye View of Software Engineering, monitoring the state and performance of software systems is one of the key responsibilities associated with the role of a site reliability engineer (SRE). Before we delve deeper into the topic of monitoring and alerting, we should probably take a few minutes and clarify some of the SRE-related terms that we will be using in the following sections.

Service-level indicators (SLIs)

An SLI is a type of metric that allows us to quantify the perceived quality of a service from the perspective of the end user. Let's take a look at some common types of SLIs that can be applied to cloud-based services...

Exploring options for collecting and aggregating metrics

The sheer complexity and level of customization that is inherent in modern, microservice-based systems has led to the development of specialized tooling to facilitate the collection and aggregation of metrics.

In this section, we will be briefly discussing a few popular pieces of software for achieving this task.

Comparing push versus pull systems

Monitoring and metrics aggregation systems can be classified into two broad categories based on the entity that initiates the data collection:

In a push-based system, the client (for example, the application or a data collection service running on a node) is responsible for transmitting the metrics data to the metrics aggregation...

Visualizing collected metrics using Grafana

By this point, you should have already selected a suitable metrics collection solution for your applications and instrumented your code base to emit the metrics that you are interested in tracking. To make sense of the collected data and reason about it, we need to visualize it.

For this task, we will be using Grafana ^[4] as our tool of choice. Grafana offers a convenient, end-to-end solution that can be used to retrieve metrics from a variety of different data sources and construct dashboards for visualizing them. The supported list of data sources includes Prometheus, InfluxDB, Graphite, Google Stackdriver, AWS CloudWatch, Azure Monitor, SQL databases (MySQL, Postgres, and SQL Server), and Elasticsearch.

If you have already set up one of the preceding data sources and want to evaluate Grafana, the easiest way to do so is to spin up...

Using Prometheus as an end-to-end solution for alerting

By instrumenting our applications and deploying the necessary infrastructure for scraping metrics, we now have the means for evaluating the SLIs for each of our services. Once we define a suitable set of SLOs for each of the SLIs, the next item on our checklist is to deploy an alert system so that we can be automatically notified every time that our SLOs stop being met.

A typical alert specification looks like this:

When the value of metric X exceeds threshold Y for Z time units, then execute actions a1, a2, a_n

What is the first thought that springs to mind when you hear a fire alarm going off? Most people will probably answer something along the lines of, there might be a fire nearby. People are naturally conditioned to assume that alerts are always temporally correlated with an issue that must be addressed immediately...

Summary

At the start of this chapter, we talked about the pros and cons of using a metrics collection system such as Prometheus to scrape and aggregate metrics data from not only our deployed applications but also from our infrastructure (for example, Kubernetes master/worker nodes).

Then, we learned how to leverage the official Prometheus client package for Go to instrument our code and export the collected metrics over HTTP so that they can be scraped by Prometheus. Next, we extolled the benefits of using Grafana for building dashboards by pulling in metrics from heterogeneous sources. In the final part of this chapter, we learned how to define alert rules in Prometheus and gained a solid understanding of using the Alertmanager tool to group, deduplicate, and route alert events that are emitted by Prometheus.

By exploiting the knowledge gained from this chapter, you will be...

Questions

What is the difference between an SLI and an SLO?
Explain how SLAs work.
What is the difference between a push- and pull-based metrics collection system?
Would you use a push- or pull-based system to scrape data from a tightly locked down (that is, no ingress) subnet?
What is the difference between a Prometheus counter and a gauge metric?
Why is it important for page notifications to be accompanied by a link to a playbook?