Reader small image

You're reading from  Learn Grafana 10.x - Second Edition

Product typeBook
Published inDec 2023
PublisherPackt
ISBN-139781803231082
Edition2nd Edition
Right arrow
Author (1)
Eric Salituro
Eric Salituro
author image
Eric Salituro

Eric Salituro is currently a Software Engineering Manger with the Enterprise Data and Analytics Platform team at Zendesk. He has an IT career spanning over 30 years, over 20 of which were in the motion picture industry working as a pipeline technical director and software developer for innovative and creative studios like DreamWorks, Digital Domain, and Pixar. Before moving to Zendesk, he worked at Pixar helping to manage and maintain their production render farm as a Senior Software Developer. Among his accomplishments there was the development of a Python API toolkit for Grafana aimed at streamlining the creation of rendering metrics dashboards
Read more about Eric Salituro

Right arrow

Monitoring Data Streams with Grafana Alerts

In our last chapter, we explored the world of real-time data streaming by combining Telegraf’s input plugins to capture CPU metrics, and by simulating an IoT metrics pipeline with the addition of a Mosquitto broker and a simple Python script standing in for an IoT device.

As part of our three-chapter exploration of Grafana’s observability features, we’re going to move from simply streaming metrics to adding a key observability feature: the ability to trigger some form of an alert when certain conditions are met. Without the ability to monitor our systems and then alert when we detect anomalous behavior, we risk deterioration, instability, or even significant outages.

We’ll start out by discussing aspects of monitoring and observability with an eye toward good strategies for identifying the alert conditions we want to watch for. Next, we’ll talk about Grafana’s alerting features, especially the...

Technical requirements

Tutorial code, dashboards, and other helpful files for this chapter can be found in the book’s GitHub repository at https://github.com/PacktPublishing/Learn-Grafana-10/tree/main/Chapter12.

Monitoring and observability

The key to observability is, of course, proper monitoring. Without the ability to monitor, there can be no awareness of the status of your systems and, consequently, no way to react to changes in those systems, be they adverse or otherwise.

In our examples, we will be looking at two kinds of monitoring: an orchestrated computing platform such as Docker Compose or Kubernetes, and a lightweight application such as a web server. We will be using the techniques we demonstrated in the previous chapter to track metrics generated by Telegraf; these principles are the same whether we’re talking about small servers or massive compute clouds. But first, let’s discuss some key concepts.

Monitoring processes

When we look at monitoring whether it’s on-premises or in the cloud, we tend to see the world from two main perspectives: how the system manages its processes, or the processes themselves. Either you are monitoring how processes are...

Alerting in Grafana

Grafana alerting has evolved significantly over the last few versions to a complex, powerful, and versatile system for combining monitoring, alerting, and notification. Its power can be a bit intimidating but bear in mind that you may not need every capability in Grafana alerting.

We’ll take things step by step, so you can see how the parts fit together. Once you understand the basics, if you run into a more complex observability scenario, you will know how best to extend your own alerting to accommodate it.

Let’s start by reviewing how the Grafana alerting works. There are four main components to Grafana alerting: alert rules, labels, notification policies, and contact points. We’ll go over their roles one by one.

Alert rules

Alert rules are the trigger mechanism for Grafana alerting. You can have all the metrics you want streaming into Grafana, but if you don’t have any alert rules, how would you know when something is wrong...

Defining alert rules

Let’s start off by talking about how we want to look at triggering alerts. To build an alert, you will need to answer a series of questions in this form:

What condition must exist as measured by what metrics, and for how long?

Let’s break this concept down into its constituent parts.

What condition…

An alert ultimately boils down to a switch: at any given moment in time, the evaluation interval, an alert may need to be triggered. How you determine whether the alert should be in a triggered (or firing) state is called the alert condition. Most of the work you will do in defining an alert condition consists of reducing metrics data to a simple Boolean yes-or-no assertion about whether an alert should be triggered.

Space prevents us from devoting an entire chapter to exploring the possible ways to define alert conditions, but I can offer some heuristics for identifying possible alert conditions:

  • Is the condition based upon...

Alert messaging to contact points

Before we can establish the notification policies that will direct our alerts to contact points, we should first define our contact points. Grafana supports an ever-growing number of contact points, so to find out more about a specific contact point, consult the Grafana documentation.

We are going to concentrate on three common contact points that cover many typical use cases. We’ll use an email contact point to represent the typical use case where email is the destination for all alert messages. The Slack contact point is used when the alert needs to be visible to a defined group, such as the members of a Slack channel. Our final contact point is PagerDuty, a destination for alerts that need to be directed to a specific person or team for potentially immediate action.

Configuring an email contact point

One of the most common and oldest forms of contact point is simply good old email. Nearly everyone has it, and access to email servers...

Routing alerts with notification policies

Now that we have our alert rules and our contact points, we’re able to link them up using our notification policies. One of the most common notification policies is to match up an alert severity with a particular contact point. That is why we initially set a severity label when we created our alert rules.

Now that we have our severity label, we can use it in a notification policy, so let’s set up such a policy. A notification policy can be as simple or complex as you want. The point is to use the information represented in the labels to determine which contact point(s) should receive your alert. It can be as simple as that.

For example, you may have a situation where you want all your low-severity (informational) incidents to go to an email address, but you want medium-severity (actionable, normal response) incidents to go to Slack or Discord, and your high-severity (actionable, immediate response) incidents to go to PagerDuty...

Summary

This was an extensive chapter, and we covered a lot of ground. Observability is becoming a large, important technology sector and Grafana is keeping up by making its alerting capabilities more powerful and versatile.

In this chapter, we set up monitoring for both Docker and NGINX using InfluxDB as a data source. We created alert rules to query and analyze the data from our monitoring, used expressions to reduce the data to a single value, and created expressions to evaluate that value for violating conditions that might need to trigger an alert. We integrated Grafana contact points with email, PagerDuty, and Slack to receive our alerts with messages that contain annotation data set by our alert rules evaluation behavior. We also established notification policies to route our alert messages to different contact points based on the severity derived from alert rule labels. Finally, we briefly considered how to set up mute timings for when we might want to disable certain alert...

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Learn Grafana 10.x - Second Edition
Published in: Dec 2023Publisher: PacktISBN-13: 9781803231082
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Eric Salituro

Eric Salituro is currently a Software Engineering Manger with the Enterprise Data and Analytics Platform team at Zendesk. He has an IT career spanning over 30 years, over 20 of which were in the motion picture industry working as a pipeline technical director and software developer for innovative and creative studios like DreamWorks, Digital Domain, and Pixar. Before moving to Zendesk, he worked at Pixar helping to manage and maintain their production render farm as a Senior Software Developer. Among his accomplishments there was the development of a Python API toolkit for Grafana aimed at streamlining the creation of rendering metrics dashboards
Read more about Eric Salituro