In this article by Giorgio Zarrelli author of the book Mastering Bash, we are moving a step now into the real world creating something that can turn out handy for your daily routine and during this process we will have a look at the common pitfalls in coding and how to make our script reliable. Be short or a long script, we must always ask ourselves the same questions:

What do we really want to accomplish?
How much time do we have?
Do we have all the resources needed?
Do we have the knowledge required for the task?

(For more resources related to this topic, see here.)

We will start coding with a Nagios plugin which will give us a broad understanding of how this monitoring system is and how to make a script dynamically interact with other programs.

What is Nagios

Nagios is one of the most widely adopted Open Source IT infrastructure monitoring tool whose main interesting feature being the fact that it does not know how to monitor anything. Well, it can sound like a joke but actually Nagios can be defined as an evaluating core which takes some informations as input and reacting accordingly. How this information is gathered? It is not the main concern of this tool and this leads us to an interesting point: Nagios leave sthe task of getting the monitored data to an external plugin which:

Knows how to connect to the monitored services
Knows how to collect the data from the monitored services
Knows how to evaluate the data

Inform Nagios if the values gathered are beyond or in the boundaries to raise an alarm.

So, a plugin does a lot of things and one would ask himself what does Nagios do then? Imagine it as an exchange pod where information is flowing in and out and decisions are taken based on the configurations set; the core triggers the plugin to monitor a service, the plugin itself returns some information and Nagios takes a decision about:

If to raise an alarm
Send a notification
Whom to notify
For how long
Which, if any action is taken in order to get back into normality

The core Nagios program does everything except actually knock at the door of a service, ask for information and decide if this information shows some issues or not.

Planning must be done, but it can be fun.

Active and passive checks

To understand how to code a plugin we have first to grasp how, on a broad scale, a Nagios check works. There are two different kinds of checks:

Active check

Based on a time range, or manually triggered, an active check sees a plugin actively connecting to a service and collecting informations. A typical example could be a plugin to check the disk space: once invoked it interfaces with (usually) the operating system, execute a df command, works on the output, extracts the value related to the disk space, evaluates it against some thresholds and report back a status, like OK , WARNING , CRITICAL or UNKNOWN.

Passive check

In this case, Nagios does not trigger anything but waits to be contacted by some means by the service which must be monitored. It seems quite confusing but let’s make a real life example. How would you monitor if a disk backup has been completed successfully? One quick answer would be: knowing when the backup task starts and how long it lasts, we can define a time and invoke a script to check the task at that given hour. Nice, but when we plan something we must have a full understanding of how real life goes and a backup is not our little pet in the living room, it’s rather a beast which does what it wants. A backup can last a variable amount of time depending on unpredictable factor.

For instance, your typical backup task would copy 1 TB of data in 2 hours, starting at 03:00, out of a 6 TB disk. So, the next backup task would start at 03:00+02:00=05:00 AM, give or take some minutes, and you setup an active check for it at 05:30 and it works well for a couple of months. Then, one early morning your receive a notification on your smartphone, the backup is in CRITICAL. You wake up, connect to the backup console and see that at 06:00 in the morning you are asleep and the backup task has not even been started by the console. Then you have to wait until 08:00 AM until some of your colleagues shows up at the office to find out that the day before the disk you backup has been filled with 2 extra TB of data due to an unscheduled data transfer. So, the backup task preceding the one you are monitoring lasted not for a couple of hours but 6 hours, and the task you are monitoring then started at 09:30 AM.

Long story short, your active check has been fired up too early, that is why it failed. Maybe your are tempted to move your schedule some hours ahead, but simply do not do it, these time slots are not sliding frames. If you move your check ahead you should then move all the checks for the subsequent tasks ahead. You do it in one week, the project manager will ask someone to delete the 2 TB in excess (they are no more of any use for the project), and your schedules will be 2 hours ahead making your monitoring useless. So, as we insisted before, planning and analyzing the context are the key factors in making a good script and, in this case, a good plugin. We have a service that does not run 24/7 like a web service or a mail service, what is specific to the backup is that it is run periodically but we do not know exactly when.

The best approach to this kind of monitoring is letting the service itself to notify us when it finished its task and what was its outcome. That is usually accomplished using the ability of most of the backup programs to send a Simple Network Monitoring Protocol (SNMP) trap to a destination to inform it of the outcome and for our case it would be the Nagios server which would have been configured to receive the trap and analyze. Add to this an event horizon so that if you do not receive that specific trap in, let’s say, 24 hours we raise an alarm anyway and you are covered: whenever the backup task gets completed, or when it times out, we receive a notification.

Nagios notifications flowchart

Return codes and thresholds

Before coding a plugin we must face some concepts that will be the stepping stone of our Nagios coding in base, one of these being the return codes of the plugin itself. As we already discussed, once the plugin collects the data about how the service is going, it evaluates these data and determines if the situation falls under one of the following status:

Return code	Status	Description
`0`	OK	The plugin checked the service and the results are inside the acceptable range
`1`	WARNING	The plugin checked the service and the results are above a warning threshold. We must keep an eye on the service
`2`	CRITICAL	The plugin checked the service and the results are above a CRITICAL threshold or the service not responding. We must react now.
`3`	UNKNOWN Unlock access to the largest independent learning library in Tech for FREE! Get unlimited access to 7500+ expert-authored eBooks and video courses covering every tech area you can think of. Renews at $19.99/month. Cancel anytime	Either we passed the wrong arguments to the plugin or there is some internal error in it.

So, our plugin will check a service, evaluate the results, and based on a threshold will return to Nagios one of the values listed in the tables and a meaningful message like we can see in the description column in the following image:

Notice the service check in red and the message in the figure above.

In the image we can see that some checks are green, meaning ok, and they have an explicative message in the description section: what we see in this section is the output of the plugin written in the stdout and it is what we will craft as a response to Nagios.

Pay attention at the ssh check, it is red, it is failing because it is checking the service at the default port which is 22 but on this server the ssh daemon is listening on a different port. This leads us to a consideration: our plugin will need a command line parser able to receive some configuration options and some threshold limits as well because we need to know what to check, where to check and what are the acceptable working limits for a service:

Where: In Nagios there can be a host without service checks (except for the implicit host alive carried on by a ping), but no services without a host to be performed onto. So any plugin must receive on the command line the indication of the host to be run against, be it a dummy host but there must be one.
How: This is where our coding comes in, we will have to write the lines of code that instruct the plugin how to connect to the server, query, collect and parse the answer.
What: We must instruct the plugin, usually with some meaningful options on the command line, on what are the acceptable working limits so that it can evaluate them and decide if notify an OK, WARNING or CRITICAL message.

That is all for our script, who to notify, when, how, for how many times and so forth. These are tasks carried on by the core, a Nagios plugin is unaware of all of this. What he really must know for an effective monitoring is what are the correct values that identify a working service. We can pass to our script two different kinds of value:

Range:A series of numeric values with a starting and ending point, like from 3 to 7 or from one number to infinite.
Threshold: It is a range with an associated alert level.

So, when our plugins perform its check, it collects a numeric value that is within or outside a range, based on the threshold we impose then, based on the evaluation it replies to Nagios with a return code and a message. How do we specify some ranges on the command line? Essentially in the following way:

[@] start_value:end_value

If the range starts from 0, the part from : to the left can be omitted. The start_value must always be a lower number than end_value.

If the range starts as start_value, it means from that number to infinity. Negative infinity can be specified using ~

Alert is generated when the collected value resides outside the range specified, comprised of the endpoints.

If @ is specified, the alert is generated if the value resides inside the range.

Let's see some practical example on how we would call our script imposing some thresholds:

Plugin call	Meaning
`./my_plugin -c 10`	CRITICAL if less than `0` or higher than `10`
`./my_plugin -w 10:20`	WARNING if less than `10` or higher than `20`
`/my_plugin -w ~:15 -c 16`	WARNING if between -infinite and `15`, critical from `16` and higher
`./my_plugin -c 35:`	CRITICAL if the value collected is below `35`
`./my_plugin -w @100:200`	CRITICAL if the value is from `100` to `200`, OK otherwise

We covered the basic requirements for our plugin that in its simplest form should be called with the following syntax:

./my_plugin -h hostaddress|hostname -w value -c value

We already talked about the need to relate a check to a host and we can do this either using a host name or a host address. It is up to us what to use but we will not fill in this piece of information because it will be drawn by the service configuration as a standard macro. We just introduced a new concept, service configuration, which is essential in making our script work in Nagios.

Summary

In this article, we learned that how the real world can turn out handy for your daily routine and during this process we also looked at the common pitfalls in coding and how we make our script reliable.