





















































In this article by Giorgio Zarrelli author of the book Mastering Bash, we are moving a step now into the real world creating something that can turn out handy for your daily routine and during this process we will have a look at the common pitfalls in coding and how to make our script reliable. Be short or a long script, we must always ask ourselves the same questions:
(For more resources related to this topic, see here.)
We will start coding with a Nagios plugin which will give us a broad understanding of how this monitoring system is and how to make a script dynamically interact with other programs.
Nagios is one of the most widely adopted Open Source IT infrastructure monitoring tool whose main interesting feature being the fact that it does not know how to monitor anything. Well, it can sound like a joke but actually Nagios can be defined as an evaluating core which takes some informations as input and reacting accordingly. How this information is gathered? It is not the main concern of this tool and this leads us to an interesting point: Nagios leave sthe task of getting the monitored data to an external plugin which:
Inform Nagios if the values gathered are beyond or in the boundaries to raise an alarm.
So, a plugin does a lot of things and one would ask himself what does Nagios do then? Imagine it as an exchange pod where information is flowing in and out and decisions are taken based on the configurations set; the core triggers the plugin to monitor a service, the plugin itself returns some information and Nagios takes a decision about:
The core Nagios program does everything except actually knock at the door of a service, ask for information and decide if this information shows some issues or not.
Planning must be done, but it can be fun.
To understand how to code a plugin we have first to grasp how, on a broad scale, a Nagios check works. There are two different kinds of checks:
Based on a time range, or manually triggered, an active check sees a plugin actively connecting to a service and collecting informations. A typical example could be a plugin to check the disk space: once invoked it interfaces with (usually) the operating system, execute a df command, works on the output, extracts the value related to the disk space, evaluates it against some thresholds and report back a status, like OK , WARNING , CRITICAL or UNKNOWN.
In this case, Nagios does not trigger anything but waits to be contacted by some means by the service which must be monitored. It seems quite confusing but let’s make a real life example. How would you monitor if a disk backup has been completed successfully? One quick answer would be: knowing when the backup task starts and how long it lasts, we can define a time and invoke a script to check the task at that given hour. Nice, but when we plan something we must have a full understanding of how real life goes and a backup is not our little pet in the living room, it’s rather a beast which does what it wants. A backup can last a variable amount of time depending on unpredictable factor.
For instance, your typical backup task would copy 1 TB of data in 2 hours, starting at 03:00, out of a 6 TB disk. So, the next backup task would start at 03:00+02:00=05:00 AM, give or take some minutes, and you setup an active check for it at 05:30 and it works well for a couple of months. Then, one early morning your receive a notification on your smartphone, the backup is in CRITICAL. You wake up, connect to the backup console and see that at 06:00 in the morning you are asleep and the backup task has not even been started by the console. Then you have to wait until 08:00 AM until some of your colleagues shows up at the office to find out that the day before the disk you backup has been filled with 2 extra TB of data due to an unscheduled data transfer. So, the backup task preceding the one you are monitoring lasted not for a couple of hours but 6 hours, and the task you are monitoring then started at 09:30 AM.
Long story short, your active check has been fired up too early, that is why it failed. Maybe your are tempted to move your schedule some hours ahead, but simply do not do it, these time slots are not sliding frames. If you move your check ahead you should then move all the checks for the subsequent tasks ahead. You do it in one week, the project manager will ask someone to delete the 2 TB in excess (they are no more of any use for the project), and your schedules will be 2 hours ahead making your monitoring useless. So, as we insisted before, planning and analyzing the context are the key factors in making a good script and, in this case, a good plugin. We have a service that does not run 24/7 like a web service or a mail service, what is specific to the backup is that it is run periodically but we do not know exactly when.
The best approach to this kind of monitoring is letting the service itself to notify us when it finished its task and what was its outcome. That is usually accomplished using the ability of most of the backup programs to send a Simple Network Monitoring Protocol (SNMP) trap to a destination to inform it of the outcome and for our case it would be the Nagios server which would have been configured to receive the trap and analyze. Add to this an event horizon so that if you do not receive that specific trap in, let’s say, 24 hours we raise an alarm anyway and you are covered: whenever the backup task gets completed, or when it times out, we receive a notification.
Before coding a plugin we must face some concepts that will be the stepping stone of our Nagios coding in base, one of these being the return codes of the plugin itself. As we already discussed, once the plugin collects the data about how the service is going, it evaluates these data and determines if the situation falls under one of the following status:
Return code |
Status |
Description |
0 |
OK |
The plugin checked the service and the results are inside the acceptable range |
1 |
WARNING |
The plugin checked the service and the results are above a warning threshold. We must keep an eye on the service |
2 |
CRITICAL |
The plugin checked the service and the results are above a CRITICAL threshold or the service not responding. We must react now. |
3 |
UNKNOWN |
Either we passed the wrong arguments to the plugin or there is some internal error in it. |
So, our plugin will check a service, evaluate the results, and based on a threshold will return to Nagios one of the values listed in the tables and a meaningful message like we can see in the description column in the following image:
Notice the service check in red and the message in the figure above.
In the image we can see that some checks are green, meaning ok, and they have an explicative message in the description section: what we see in this section is the output of the plugin written in the stdout and it is what we will craft as a response to Nagios.
Pay attention at the ssh check, it is red, it is failing because it is checking the service at the default port which is 22 but on this server the ssh daemon is listening on a different port. This leads us to a consideration: our plugin will need a command line parser able to receive some configuration options and some threshold limits as well because we need to know what to check, where to check and what are the acceptable working limits for a service:
That is all for our script, who to notify, when, how, for how many times and so forth. These are tasks carried on by the core, a Nagios plugin is unaware of all of this. What he really must know for an effective monitoring is what are the correct values that identify a working service. We can pass to our script two different kinds of value:
So, when our plugins perform its check, it collects a numeric value that is within or outside a range, based on the threshold we impose then, based on the evaluation it replies to Nagios with a return code and a message. How do we specify some ranges on the command line? Essentially in the following way:
[@] start_value:end_value
If the range starts from 0, the part from : to the left can be omitted. The start_value must always be a lower number than end_value.
If the range starts as start_value, it means from that number to infinity. Negative infinity can be specified using ~
Alert is generated when the collected value resides outside the range specified, comprised of the endpoints.
If @ is specified, the alert is generated if the value resides inside the range.
Let's see some practical example on how we would call our script imposing some thresholds:
Plugin call |
Meaning |
./my_plugin -c 10 |
CRITICAL if less than 0 or higher than 10 |
./my_plugin -w 10:20 |
WARNING if less than 10 or higher than 20 |
/my_plugin -w ~:15 -c 16 |
WARNING if between -infinite and 15, critical from 16 and higher |
./my_plugin -c 35: |
CRITICAL if the value collected is below 35 |
./my_plugin -w @100:200 |
CRITICAL if the value is from 100 to 200, OK otherwise |
We covered the basic requirements for our plugin that in its simplest form should be called with the following syntax:
./my_plugin -h hostaddress|hostname -w value -c value
We already talked about the need to relate a check to a host and we can do this either using a host name or a host address. It is up to us what to use but we will not fill in this piece of information because it will be drawn by the service configuration as a standard macro. We just introduced a new concept, service configuration, which is essential in making our script work in Nagios.
In this article, we learned that how the real world can turn out handy for your daily routine and during this process we also looked at the common pitfalls in coding and how we make our script reliable.
Further resources on this subject: