Notifications and Events in Nagios 3.0- part2

Exclusive offer: get 50% off this eBook here
Learning Nagios 3.0

Learning Nagios 3.0 — Save 50%

A comprehensive configuration guide to monitor and maintain your network and systems

£14.99    £7.50
by Wojciech Kocjan | May 2009 | Linux Servers Networking & Telephony Open Source

This is the second part of the two part series by Wojciech Kocjan in which we have covered notifications and events in Nagios 3.0. The first part covered:

In this article, we will cover the following sub-topics:

  • External Commands
  • Event Handlers
  • Modifying Notifications
  • Adaptive Monitoring

 

External Commands

Nagios offers a very powerful mechanism for receiving events and commands from external applications—the external commands pipe. This is a pipe file created on a file system that Nagios uses to receive incoming messages. The name of the file is rw/nagios.cmd and it is located in the directory passed as the localstatedir option during compilation. Following the compilation and installation instructions and the given guidelines, the file name will be /var/nagios/rw/ nagios.cmd.

The communication does not use any authentication or authorization—the only requirement is to have write access to the pipe file. An external command file is usually writable by the owner and the group; the usual group used is nagioscmd. If you want a user to be able to send commands to the Nagios daemon, simply add that user to this group.

A small limitation of the command pipe is that there is no way to get any results back and so it is not possible to send any query commands to Nagios. Therefore, by just using the command pipe, you have no verification that the command you have just passed to Nagios has actually been processed, or will be processed soon. It is, however, possible to read the Nagios log file and check if it indicates that the command has been parsed correctly, if necessary.

An external command pipe is used by the web interface to control how Nagios works. The web interface does not use any other means to send commands or apply changes to Nagios. This gives a good understanding of what can be done with the external command pipe interface.

From the Nagios daemon perspective, there is no clear distinction as to who can perform what operations. Therefore, if you plan to use the external command pipe to allow users to submit commands remotely, you need to make sure that the authorization is in place as well so that it is not possible for unauthorized users to send potentially dangerous commands to Nagios.

The syntax for formatting commands is easy. Each command must be placed on a single line and end with a newline character. The syntax is as follows:

[TIMESTAMP] COMMAND_NAME;argument1;argument2;...;argumentN

TIMESTAMP is written as UNIX time—that is the number of seconds since 1970-01-01 00:00:00. This can be created by using the date +%s system command. Most programming languages also offer the means to get the current UNIX time. Commands are written in upper case. This can be one of the commands that Nagios should execute, and the arguments depend on the actual command.

For example, to add a comment to a host stating that it has passed a security audit, one can use the following shell command:

echo "['date +%s'] ADD_HOST_COMMENT;somehost;1;Security Audit;
This host has passed security audit on 'date +%Y-%m-%d'"
>/var/nagios/rw/nagios.cmd

This will send an ADD_HOST_COMMENT command (visit http://www.nagios.org/developerinfo/externalcommands/commandinfo.php? command_id=1) to Nagios over the external command pipe. Nagios will then add a comment to the host, somehost, stating that the comment originated from Security Audit. The first argument specifies the host name to add the comment to; the second tells Nagios if this comment should be persistent. The next argument describes the author of the comment, and the last argument specifies the actual comment text.

Similarly, adding a comment to a service requires the use of the ADD_SVC_COMMENT command (visit http://www.nagios.org/developerinfo/externalcommands/ commandinfo.php? command_id=1) . The command's syntax is very similar to the ADD_HOST_COMMENT command except that the command requires the specification of the host name and service name.

For example, to add a comment to a service stating that it has been restarted, you should use the following

echo "['date +%s'] ADD_SVC_COMMENT;router;OpenVPN;1;nagiosadmin;
Restarting the OpenVPN service" >/var/nagios/rw/nagios.cmd

The first argument specifies the host name to add the comment to; the second is the description of the service to which Nagios should add the comment. The next argument tells Nagios if this comment should be persistent. The fourth argument describes the author of the comment, and the last argument specifies actual comment text.

You can also delete a single comment or all comments using the DEL_HOST_ COMMENT (visit http://www.nagios.org/developerinfo/externalcommands/ commandinfo.php? command_id=3), DEL_ALL_HOST_COMMENTS (visit http://www. nagios.org/developerinfo/externalcommands/commandinfo.php? command_ id=13), and DEL_SVC_COMMENT (visit http://www.nagios.org/developerinfo/externalcommands/commandinfo.php? command_id=4) or DEL_ALL_SVC_COMMENTS commands (visit http://www.nagios.org/developerinfo/externalcommands/ commandinfo.php? command_id=14).

Other commands worth mentioning are related to scheduling checks on demand. Very often, it is necessary to request that a check be carried out as soon as possible; for example, when testing a solution.

This time, let's create a script that schedules a check of a host, all services on that host, and a service on a different host, as follows:

#!/bin/sh
NOW='date +%s'
echo "[$NOW] SCHEDULE_HOST_CHECK;somehost;$NOW"
>/var/nagios/rw/nagios.cmd
echo "[$NOW] SCHEDULE_HOST_SVC_CHECKS;somehost;$NOW"
>/var/nagios/rw/nagios.cmd
echo "[$NOW] SCHEDULE_SVC_CHECK;otherhost;Service Name;$NOW"
>/var/nagios/rw/nagios.cmd
exit 0

The commands SCHEDULE_HOST_CHECK (visit http://www.nagios.org/ developerinfo/externalcommands/commandinfo.php? command_id=127) and SCHEDULE_HOST_SVC_CHECKS (http://www.nagios.org/developerinfo/externalcommands/commandinfo.php?command_id=30) accept a host name and the time at which the check should be scheduled. The SCHEDULE_SVC_CHECK command (visit http://www.nagios.org/developerinfo/externalcommands/ commandinfo.php? command_id=29) requires the specification of a service description as well as the name of the host to schedule the check on.

Normal scheduled checks, such as the ones scheduled above, might not actually take place at the time that you scheduled them. Nagios also needs to take allowed time periods into account as well as checking whether checks were disabled for a particular object or globally for the entire Nagios.

There are cases when you'll need to force Nagios to do a check—in such cases, you should use SCHEDULE_FORCED_HOST_CHECK (visit http://www.nagios. org/developerinfo/externalcommands/commandinfo.php? command_id=128), SCHEDULE_FORCED_HOST_SVC_CHECKS (visit http://www.nagios.org/developerinfo/externalcommands/commandinfo.php? command_id=130) and SCHEDULE_FORCED_SVC_CHECK (visit http://www.nagios.org/developerinfo/ externalcommands/commandinfo.php? command_id=129) commands. They work in exactly the same way as described above, but make Nagios skip the checking of time periods, and ensure that the checks are disabled for this particular object. This way, a check will always be performed, regardless of other Nagios parameters

Other commands worth using are related to custom variables, introduced in Nagios 3. When you define a custom variable for a host, service, or contact, you can change its value on the fly with the external command pipe.

As these variables can then be directly used by check or notification commands and event handlers, it is possible to make other applications or event handlers change these attributes directly without modifications to the configuration files.

A good example would be that the IT staff registers its presence via an application without any GUI. This application periodically sends information about the latest known IP address, and that information is then passed to Nagios assuming that the person is in the office. This would later be sent to a notification command to use that specific IP address while sending a message to the user.

Assuming that the user name is jdoe and the custom variable name is DESKTOPIP, the message that would be sent to the Nagios external command pipe would be as follows:

[1206096000] CHANGE_CUSTOM_CONTACT_VAR;jdoe;DESKTOPIP;12.34.56.78

This would cause a later use of $_CONTACTDESKTOPIP$ to return a value of 12.34.56.78.

Nagios offers the CHANGE_CUSTOM_CONTACT_VAR (visit http://www.nagios.org/developerinfo/externalcommands/commandinfo.php? command_id=141), CHANGE_CUSTOM_HOST_VAR (visit http://www.nagios.org/developerinfo/ externalcommands/commandinfo.php?command_id=139), and CHANGE_CUSTOM_ SVC_VAR (visit http://www.nagios.org/developerinfo/externalcommands/commandinfo.php?command_id=140) commands for modifying custom variables in contacts, hosts and, services accordingly.

The commands explained above are just a very small subset of the full capabilities of the Nagios external command pipe. For a complete list of commands, visit http:// www.nagios.org/developerinfo/externalcommands/commandlist.php, where the External Command List can be seen. External commands are usually sent from event handlers or from the Nagios web interface. You will find external commands most useful when writing event handlers for your system, or when writing an external application that interacts with Nagios.

Event Handlers

Event handlers are commands that are triggered whenever the state of a host or  service changes. They offer functionality similar to notifications. The main difference is that the event handlers are called for each type of change and even for each soft state change. This provides the ability to react to a problem before Nagios notifies it as a hard state and sends out notifications about it. Another difference is what the event handlers should do. Instead of notifying users that there is a problem, event handlers are meant to carry out actions automatically.

For example, if a service defined with max_check_attempts set to 4, the retry_ interval set to 1, and the check_interval is set to 5, then the following example illustrates when event handlers would be triggered, and with what values, for $SERVICESTATE$, $SERVICESTATETYPE$, and $SERVICEATTEMP$ macro definitions:

Learning Nagios 3.0

Event handlers are triggered for each state change—for example, in minutes, 10, 23, 28, and 29. When writing an event handler, it is necessary to check whether an event handler should perform an action at that particular time or not. See the following example for more details.

A typical example might be that your web server process tends to crash once a month. Because this is rare enough, it is very difficult to debug and resolve it. Therefore, the best way to proceed is to restart the server automatically until a solution to the problem is found.

If your configuration has max_check_attempts set to 4, as in the example above, then a good place to try to restart the web server is after the third soft failure check—in the previous example, this would be minute 12.

Assuming that the restart has been successful, the diagram shown above would look like this:

Learning Nagios 3.0

Please note that no hard critical state has occurred since the event handler resolved the problem. If a restart cannot resolve the issue, Nagios will only try it once, as the attempt is done only in the third soft check.

Event handlers are defined as commands, similar to check commands. The main difference is that the event handlers only use macro definitions to pass information to the actual event handling script. This implies that the $ARGn$ macro definitions cannot be used and arguments cannot be passed in the host or service definition by using the ! separator.

In the previous example, we would define the following command:

define command
{
command_name restart-apache2
command_line $USER1$/events/restart_apache2
$SERVICESTATE$ $SERVICESTATETYPE$ $SERVICEATTEMPT$
}

The command would need to be added to the service. For both hosts and services, this requires adding an event_handler directive that specifies the command to be run for each event that is fired. In addition, it is good to set event_handler_enabled set to 1 to make sure that event handlers are enabled for this object.

The following is an example of a service definition:

define service
{
host_name localhost
service_description Webserver
use apache
event_handler restart-apache2
event_handler_enabled 1
}

Finally, a short version of the script is as follows:

#!/bin/sh
# use variables for arguments
SERVICESTATE=$1
SERVICESTATETYPE=$2
SERVICEATTEMPT=$3
# we don't want to restart if current status is OK
if [ "$SERVICESTATE" != "OK" ] ; then
# proceed only if we're in soft transition state
if [ "$SERVICESTATETYPE" == "SOFT" ] ; then
# proceed only if this is 3rd attempt, restart
if [ "$SERVICESTATEATTEMPT" == "3" ] ; then
# restarts Apache as system administrator
sudo /etc/init.d/apache2 restart
fi
fi
fi
exit 0

As we're using sudo here, obviously the script needs an entry in the sudoers file to allow the nagios user to run the command without a password prompt. An example entry for the sudoers file would be as follows:

nagios ALL=NOPASSWD: /etc/init.d/apache2

This will tell sudo that the command /etc/init.d/apache2 can be run by the user nagios and that asking for passwords before running the command will not be done.

According to our script, the restart is only done after the third check fails. Assuming that the restart went correctly, the next Nagios check will notify that Apache is running again. As this is considered a soft state, Nagios has not yet sent out any notifications about the problem

If the service would not restart correctly, the next check will cause Nagios to set this failure as a hard state. At this point, notifications will be sent out to the object owners.

You can also try performing a restart in the second check. If that did not help, then during the third attempt, the script can forcefully terminate all Apache2 processes using the killall or pkill command. After this has been done, it can try to start the service again. For example:

# proceed only ifthis is 3rd attempt, restart
if [ "$SERVICESTATEATTEMPT" == "2" ] ; then
# restart Apache as system administrator
sudo /etc/init.d/apache2 restart
fi
# proceed only ifthis is 3rd attempt, restart
if [ "$SERVICESTATEATTEMPT" == "3" ] ; then
# try to terminate apache2 process as system administrator
sudo pkill apache2
# starts Apache as system administrator
sudo /etc/init.d/apache2 start
fi

Another common scenario is to restart one service if another one has just recovered—for example, you might want to restart email servers that use a database for authentication if the database has just recovered from a failure state. The reason for doing this is that some applications may not handle disconnected database handles correctly—this can lead to the service working correctly from the Nagios perspective, but not allowing some of the users in due to internal problems.

If you have set this up for hosts or services, it is recommended that you keep flapping enabled for these services. It often happens that due to incorrectly planned scripts and the relations between them, some services might end up being stopped and started again.

In such cases, Nagios will detect these problems and stop running event handlers for these services, which will cause fewer malfunctions to occur. It is also recommended that you keep notifications set up so that people also get information on when flapping starts and stops.

Modifying Notifications

An interesting new feature in Nagios 3 is the ability to change various parameters related to notifications. These parameters are modified via an external command pipe, similar to a few of the commands shown in the previous section.

A good example would be when Nagios contact persons have their workstations connected to the local network only when they are actually at work (which is usually the case if they are using notebooks), and turn their computers off when they leave work. In such a case, a ping check for a person's computer could trigger an event handler to toggle that person's attributes.

Let's assume that our user jdoe has two actual contacts—jdoe-email and jdoejabber, each for different types of notifications. We can set up a host corresponding to the jdoe workstation. We will also set it up to be monitored every five minutes and create an event handler. The handler will change the jdoe-jabber's host and service notification time period to none on a hard host down state. On a host up state change, the time period for jdoe-jabber will be set to 24x7. This way, the user will only get Jabber notifications if he or she is at work.

Nagios offers commands to change the time periods during which a user wants to receive notifications. The commands for this purpose are: CHANGE_CONTACT_HOST_ NOTIFICATION_TIMEPERIOD ( http://www.nagios.org/developerinfo/ externalcommands/commandinfo.php? command_id=153 ) and CHANGE_CONTACT_ SVC_NOTIFICATION_TIMEPERIOD ( http://www.nagios.org/developerinfo/externalcommands/commandinfo.php?command_id=152 ). Both commands take the contact and the time period name as their arguments.

An event handler script that modifies the user's contact time period based on state is as follows

#!/bin/sh
NOW='date +%s'
CONTACTNAME=$1-jabber
if [ "$2,$3" = "DOWN,HARD" ] ; then
TP=none
else
TP=24x7
fi
echo "[$NOW] CHANGE_CONTACT_HOST_NOTIFICATION_TIMEPERIOD;
$CONTACT;$TP"
>/var/nagios/rw/nagios.cmd
echo "[$NOW] CHANGE_CONTACT_SVC_NOTIFICATION_TIMEPERIOD;
$CONTACT;$TP"
>/var/nagios/rw/nagios.cmd
exit 0

The command should pass $CONTACTNAME$, $SERVICESTATE$, and $SERVICESTATETYPE$ as parameters to the script.

In case you need a notification about a problem sent again, you should use the SEND_CUSTOM_HOST_NOTIFICATION or SEND_CUSTOM_SVC_NOTIFICATION command. These commands take host or host and service names, additional options, author name, and comments that should be put in the notification. Options allow specifying if the notification should also include all escalation levels (a value of 1), if Nagios should skip time periods for specific users (a value of 2) as well as if Nagios should increment notifications counters (a value of 4). Options are stored bitwise so a value of 7 (1+2+4) would enable all of these options. The notification would be sent to all people including escalations; it will be forced, and the escalation counters will be increased. Option value 3 means it should be broadcast to all escalations as well, and the time periods should be skipped.

To send a custom notification about the main router to all users including escalations, you should send the following command to Nagios

[1206096000] SEND_CUSTOM_HOST_NOTIFICATION;router1;3;jdoe;RESPOND ASAP
Learning Nagios 3.0 A comprehensive configuration guide to monitor and maintain your network and systems
Published: October 2008
eBook Price: £14.99
Book Price: £24.99
See more
Select your format and quantity:

Adaptive Monitoring

Nagios 3 introduces a very powerful feature called adaptive monitoring that allows the modification of various check-related parameters on the fly. This is done by sending a command to the Nagios external command pipe.

The first thing that can be changed on the fly is the command to be executed by Nagios, along with the attributes that will be passed to it—an equivalent of the check_command directive in the object definition. In order to do that, you can use the CHANGE_HOST_CHECK_COMMAND or CHANGE_SVC_CHECK_ COMMAND command. These require the host name, or the host name and service description, and the check command as arguments.

This can be used to actually change how hosts or services are checked, or to only modify parameters that are passed to the check commands—for example, a check for ping latency can be modified based on whether a primary or a backup connection is used. An example to change a check command of a service, which changes the command and its specified parameters, is as follows:

[1206096000] CHANGE_SVC_CHECK_COMMAND;linux1;PING;check_ping!500.0,50%

A similar possibility is to change the custom variables that are used later in a check command. An example where the following command and service are used is:

define command
{
command_name check-ping
command_line $USER1$/check_ping -H $HOSTADDRESS$
-p $_SERVICEPACKETS$ -w $_SERVICEWARNING$
-c $_SERVICECRITICAL$
}
define service
{
host_name linux2
service_description PING
use ping
_PACKETS 5
_WARNING 100.0,40%
_CRITICAL 300.0,60%
}

This example is very similar to the one we saw earlier. The main benefit is that parameters can be set independently—for example, one event handler might modify the number of packets to send while another one can modify the warning and/or critical state limits.

The following is an example to modify the warning level for the ping service on a linux1 host:

[1206096000] CHANGE_CUSTOM_SVC_VAR;linux1;PING;_WARNING;500.0,50%

As us the case for check commands, it is also possible to modify event handlers on the fly. This can be used to enable or disable scripts that try to resolve a problem. To do this, you need to use the CHANGE_HOST_EVENT_HANDLER and CHANGE_SVC_EVENT_HANDLER commands.

In order to set an event handler command for the Apache2 service mentioned previously, you need to send the following command:

[1206096000] CHANGE_SVC_EVENT_HANDLER;localhost;webserver;
restart-apache2

Please note that setting an empty event handler disables any previous event handlers for this host or service. The same comment also applies for modifying the check command definition. In case you are modifying commands or event handlers, please make sure that the corresponding command definitions actually exist; otherwise, Nagios might reject your modifications.

Another feature that you can use to fine-tune the execution of checks is the ability to modify the time period during which a check should be performed. This is done with the CHANGE_HOST_CHECK_TIMEPERIOD and CHANGE_SVC_CHECK_TIMEPERIOD commands. Similar to the previous commands, these accept the host, or host and service names, and the new time period to be set. See the following example:

[1206096000] CHANGE_SVC_CHECK_TIMEPERIOD;localhost;webserver;
workinghours

As is the case with command names, you need to make sure that the time period you are requesting to be set exists in the Nagios configuration. Otherwise, Nagios will ignore this command and leave the current check time period.

Nagios also allows modifying intervals between checks—both for the normal checks, and retrying during soft states. This is done through the CHANGE_NORMAL_ HOST_CHECK_INTERVAL , CHANGE_RETRY_ HOST_CHECK_INTERVAL , CHANGE_NORMAL_ SVC_CHECK_INTERVAL , and CHANGE_RETRY_SVC_CHECK_INTERVAL commands. All of these commands require passing the host, or the host and service names, as well as the intervals that should be set.

A typical example of when intervals would be modified on the fly is when the priority of a host or service relies on other parameters in your network. An example might be a backup server

Making sure that the host and all of services on it are working properly is very important before actually performing scheduled backups. During idle time, its priority might be much lower. Another issue might be that monitoring the backup server should be performed more often in case the primary server fails.

An example to modify the normal interval for a host to every 15 minutes is as follows:

[1206096000] CHANGE_NORMAL_HOST_CHECK_INTERVAL;backupserver;15

There is also the possibility to modify how many checks need to be performed before a state is considered to be hard. The commands for this are CHANGE_ MAX_HOST_CHECK_ATTEMPTS and CHANGE_ MAX_SVC_CHECK_ATTEMPTS .

The following is an example command to modify max retries for a host to 5:

[1206096000] CHANGE_MAX_HOST_CHECK_ATTEMPTS;linux1;5

There are many more commands that allow the fine tuning of monitoring and checks on the fly. It is recommended that you get acquainted with all of the external commands that your version of Nagios supports, as mentioned in the section introducing the external commands pipe.

Summary

Nagios offers several ways to let people know that something is wrong. Notifications can range from simple emails to a complex system that deals with multiple ways to communicate problems, as well as the ability to choose the appropriate way dynamically. This will help eliminate people from having to deal with their emails, and will help in resolving issues much more effectively.

Nagios can deliver information about problems in almost any way you can possibly imagine. Notifications can be sent as emails, instant messages, and Windows messaging texts. You can also have a text message over GSM networks; whatever works best for you and your colleagues. You can even set up VoIP combined with speech synthesis to let people know what the problems are.

Another feature of Nagios that allows great flexibility is the external commands pipe. This offers a simple way to send commands directly to Nagios. It can be used from any programming language. Commands can be sent in various situations. Commands can range from adding a comment to an object, to a complete restart of Nagios. External commands also allow enabling and disabling checks, flapping detection, and many other Nagios functionality.

Sending commands to Nagios also provides Nagios' event handlers with the possibility to send commands that affect how Nagios performs and how it notifies users about problems or problem recoveries. It also allows fine tuning of the monitoring of your network infrastructure.

Nagios 3 provides huge advancements in this area, which makes it much easier to create a complex IT monitoring system. This is of great benefit to medium and large networks, where the ability to dynamically adapt to a situation is a must.

We have covered a lot in this 2-part series. To summarize, we have covered:

  • Effective Notifications
  • Escalations
  • External Commands
  • Event Handlers
  • Modifying Notifications
  • Adaptive Monitoring
Learning Nagios 3.0 A comprehensive configuration guide to monitor and maintain your network and systems
Published: October 2008
eBook Price: £14.99
Book Price: £24.99
See more
Select your format and quantity:

About the Author :


Wojciech Kocjan

Wojciech Kocjan is a system administrator and programmer with 10 years of experience. His work experience includes several years of using Nagios for enterprise IT infrastructure monitoring. He also has experience in large variety of devices and servers, routers, Linux, Solaris, AIX servers and i5/OS mainframes. His programming experience includes multiple languages (such as Java, Ruby, Python, and Perl) and focuses on web applications as well as client-server solutions.

Books From Packt


Building Enterprise Ready Telephony Systems with sipXecs 4.0: RAW
Building Enterprise Ready Telephony Systems with sipXecs 4.0: RAW

eZ Publish 4: Enterprise Web Sites Step-by-Step
eZ Publish 4: Enterprise Web Sites Step-by-Step

Apache Maven 2 Effective Implementations: RAW
Apache Maven 2 Effective Implementations: RAW

Pentaho Reporting 1.0 for Java Developers
Pentaho Reporting 1.0 for Java Developers

Asterisk 1.4 – the Professional’s Guide
Asterisk 1.4 – the Professional’s Guide

Drupal 5 Views Recipes
Drupal 5 Views Recipes

Scratch 1.3: Beginner’s Guide
Scratch 1.3: Beginner’s Guide

Choosing an Open Source CMS: Beginner's Guide
Choosing an Open Source CMS: Beginner's Guide


Code Download and Errata
Packt Anytime, Anywhere
Register Books
Print Upgrades
eBook Downloads
Video Support
Contact Us
Awards Voting Nominations Previous Winners
Judges Open Source CMS Hall Of Fame CMS Most Promising Open Source Project Open Source E-Commerce Applications Open Source JavaScript Library Open Source Graphics Software
Resources
Open Source CMS Hall Of Fame CMS Most Promising Open Source Project Open Source E-Commerce Applications Open Source JavaScript Library Open Source Graphics Software