Notifications and Events in Nagios 3.0-part1

Exclusive offer: get 50% off this eBook here
Learning Nagios 3.0

Learning Nagios 3.0 — Save 50%

A comprehensive configuration guide to monitor and maintain your network and systems

$23.99    $12.00
by Wojciech Kocjan | May 2009 | Linux Servers Networking & Telephony Open Source

This is a 2-part series by Wojciech Kocjan. We have made an attempt to cover all about events and notifications in Nagios 3.0 in detail in this series. The following sub-topics will be covered as a part of this series:

  • Effective Notifications
  • Escalations
  • External Commands
  • Event Handlers
  • Modifying Notifications
  • Adaptive Monitoring

 

There's lots what Nagios can do, and how it can make your life easier. Imagine that you set up Nagios to send a text message to your mobile during day time. It can also send you a message on Jabber or MSN. Imagine that you also make Nagios stop notifying you when your workstation is not online.

Even though the above examples above seem complicated, they are actually quite simple to implement. It's a matter of combining event handlers with custom variables, and a little ingenuity. A service that will check if a user's workstation is present can have an event handler to automatically enable and disable host and/or service notifications for a contact or contact group.

It's also possible to set up your monitoring to notify managers if the issue has not been fixed within a certain period of time. Based on the importance of a host or service, these can be different managers that are notified and different time periods after which the notification is sent. Nagios can also be used to notify emergency response teams, so that if a problem is not fixed in a short period of time, they will assist in recovering from the potential after effects of this problem.

There are cases when you want Nagios to perform one or more actions if a service starts or stops malfunctioning. For instance, you might have a web server set up to retry five times before a failure becomes a hard state for Nagios. In such a case, you can also configure Nagios to try restarting itself after the third soft failure —if it fails, it will move to a hard state after the next two failures. In case the restart succeeds, a hard state will not even get recorded and only a soft failure will get logged.

Nagios is able to integrate itself with other applications that can send commands to Nagios directly and can report the status of host or service checks. Sending commands can be used by Nagios web interface, but you might as well use it inside your application or event handlers for various objects.

Effective Notifications

This section covers notifications in depth and describes the details of how Nagios can tell other people about what is happening. We will discuss a simple approach, as well as a more complex approach on how notifications can make your life easier.

Probably, most people already know that a plain email notification about a problem may not always be the right thing to do. As people's inboxes get cluttered with emails, the usual approach is to create rules to move certain messages that they don't even look at to separate folders. There's a pretty good chance that if people start getting a lot of notifications that they won't need to react to, they'll simply ask their favorite mailing program to move these messages into a 'do not look in here unless you have plenty of time' folder. Moreover, in such cases, if there is an issue they should be handling, they will most probably not even see the notification email.

This section talks about the things that can be implemented in your company to make notifications more convenient to the IT staff. Limiting the amount of irrelevant information sent to various people tends to increase their response time, as they will have much less information to filter out.

At this point, it's worth mentioning that there's another easy solution. Again, most people do not use it even though it offers a very flexible set up in an easy way. The approach is to create multiple contacts for a single person. For example, you can set up different contacts when you're at work, when you're offline, and define a profile to not to disturb you too much during the night.

The first issue that many Nagios administrators overlook is the ability to create more than one notification command. In this way, Nagios can try to notify you on both instant messaging (such as Jabber, Gtalk, MSN, or Yahoo) and email. It can also send you an SMS. A disadvantage is that at some point, you might end up receiving SMSes at 2 AM about an outage of a machine that may well be down for the next 3days and is not critical.

For example you can set up the following contacts to handle various times of the day in a different fashion:

  • jdoe-workhours would be a contact that will only receive notifications during working hours; notifications will be carried out using both the corporate IM system and an email
  • jdoe-daytime would be a contact that will only receive notifications between 7 AM and 10 PM, excluding working hours; notifications will be sent as a text or a pager message, and an email
  • jdoe-night would be a contact that will only receive notifications between 10 PM and 7 AM; notifications will only be sent out as an email

All entries would also contain contactgroups pointing to the same groups that the single jdoe contact entry used to contain. This way, the other objects such as hosts, services, or contact groups related to this user would not be affected. All entries would also reside in the same file; for example, contacts/jdoe.cfg.

The main drawback of this approach is that logging on to the web interface would require using one of the users above or keeping the jdoe contact without any notifications, just to be able to log on to the interface.

The example above combined both the creation of multiple contacts and use of multiple notification commands to achieve a convenient way of getting notified about a problem. Using only multiple contacts also works fine. Another approach to the problem is to define different contacts for different ways of being notified—for example, jdoe-email, jdoe-sms, and jdoe-jabber. This way, you can define different contact methods for various time periods—instant messages during working hours, SMSes while on duty, and an email when not at work.

Another important issue is to make sure that as few people as possible are notified of the problem. Imagine there is a host without an explicit administrator assigned to it. A notification about a problem gets sent out to 20 different people. In such a case, either each of them will assume that someone else will resolve the problem, or people will run into a communication problem over discussing who will actually handle it.

Teams that cooperate tightly with each other usually solve these issues naturally—knowledgeable people start discussing a solution and a natural person to solve the issue comes out of the discussion. However, the teams that are distributed across various locations or that have poor communication skills will run into problems in such cases.

This is why, it is a good idea to either nominate a coordinator who will assign tasks as they arise, or try to maintain a short list of people responsible for each machine. If you need to make sure that other people will investigate the problem if the original owner of the machine cannot do it immediately, then it is a good idea to use escalations for this purpose. These are described later in this article.

Previously, we mentioned that notifications only via email may not always be the best thing to do. For example, they don't work well for situations that require fast response times. There are various reasons behind this. Firstly, emails are slow—even though the email lands on your mail server in a few seconds, people usually only poll their emails every few minutes. Secondly, people tend to filter emails and skip those that they are not interested in.

Another good reason why emails should not always be used is that they stay on your email account until you actually fetch and read them. If you have been on a 2-week vacation and a problem has occurred, should you still be worried when you read it after you get back? Has the issue been resolved already?

If your team needs to react to problems promptly, using email as the basic notification method is definitely not the best choice. Let's consider what other possibilities exist to notify users of a problem effectively.

As already mentioned, a very good choice is to use instant messaging or SMS (Simple Messaging Service) messages as the basic means of notification, and only use email as a last resort. Some companies might also use the client-server approach to notify the users of the problems, perhaps integrated with showing Nagios' status only for particular hosts and services. NagiosExchange has plenty of available solutions you can use for handling notifications effectively.

The first and the most powerful option is to use Jabber for notifications. There is an existing script for this that is available in the contributions repository on the Nagios project website. This is a small Perl script that sends messages over Jabber. You may need to install additional system packages to handle Jabber connectivity from Perl. On Ubuntu, this requires running the following command:

root@ubuntu1:~# apt-get install libnet-jabber-perl

If you are using CPAN to install Perl packages, then simply run the following command:

root@ubuntu1:~# cpan install Net::Jabber

In order to use the notification plugin, you will need to customize the script—change the SERVER, PORT, USER, and PASSWORD parameters to an existing account. Our recommendation is to create a separate account to use only for Nagios notifications—you will need to set up authorization for each user that you want to send notifications to.

As you plan to monitor servers and potentially even outgoing Internet connectivity, it would not be wise to use public Jabber servers for reporting errors. Therefore, it would be a good idea to set up a private Jabber server, probably on the same host on which the Nagios monitoring system is running.

If you plan to have a more comfortable setup, you can also use Tkabber as a Jabber client, and write a plugin that reads object's cache and the current status from the Nagios host and shows an up-to-date report for hosts that you are the owner of. Information on reading Nagios output can be found on my Tclmentor blog

Another possibility is to send messages over SMB/CIFS protocol. This way, you can send messages directly to the computers, assuming people are running the Microsoft Windows operating system. There is also the possibility of receiving messages using Samba package on UNIX machines. This requires having the smbclient command installed. On Ubuntu, this requires running the following command:

root@ubuntu1:~# apt-get install smbclient

A simple command definition example that uses smbclient directly to send messages to the specified host name is as follows:

define command
{
command_name notify_host_via_smbclient
command_line printf "Host notification: $NOTIFICATIONTYPE$nn
Host: $HOSTNAME$n
State: $HOSTSTATE$
Address: $HOSTADDRESS$n
Info: $HOSTOUTPUT$" |
smbclient -M $_CONTACTSMBHOSTNAME$
}

The preceding example uses the $_CONTACTSMBHOSTNAME$ macro definition. It maps to the _SMBHOSTNAME custom variable defined for a specified contact. In order for Windows XP and 2003 to show the messages from other users correctly, you will need to enable the Messenger service. This can be done by running the following command as the system administrator, or as a user with administrator privileges:

C> net start Messenger

Another way to communicate problems to the users is to use text messages, also known as SMS. This is a very sensitive issue because if your system is not properly configured, it can send a message in the middle of a night about a noncritical thing that can be fixed within the next 5 working days.

There is a very useful package for handling of SMS sending called SMSServerTools. It allows the configuration of email and web gateways, as well as sending text messages over dedicated GSM (Global System for Mobile Communication) terminals. The tool offers the ability to queue text messages so that it handles a higher number of messages to be sent by the appropriate means.

GSM terminals work in a manner similar to a typical mobile phone. They use a standard SIM card and have a normal GSM phone module that is used to send SMS messages. Terminals are usually connected via a serial port or USB connection. Your server can then send messages by sending commands to the terminal. GSM terminals use the same command convention as phone modems, although each model uses a different set of commands. For information on how you can send SMS messages over it, please refer the terminal's user manual.

Current mobile phones also offer cheap Internet connectivity, and smart devices offer the possibility to write custom applications in Java, .NET, and many other languages including Python and Tcl. Therefore, you can also make a client-server application that queries the server for the status of selected hosts and services. It can even be unified with a notification command that pushes the changes down to the application immediately.

These are only a few of the possibilities that you can use to communicate problems more effectively.

Other possibilities include a ready-to-use client-server application (visit http://
www.nagiosexchange.org/Notifications.35.0.html?tx_netnagext_pi1[p_view]=182
) that allows the sending of notifications to people directly to their desktop machines. One interesting notification command allows you to choose other commands to use based on user availability on Jabber—this sends messages over Jabber if the user is are available and uses SMSes or emails otherwise. (Visit http://www.nagiosexchange.org/Notifications.35.0.html?&tx_netnagext_pi1[p_view]=1036).

There are also tools to send messages to ICQ users and ones that use VoIP technology to provide you with predefined wave messages or output from a speech synthesis system.

Escalations

A common problem with resolving problems is that a host or a service may have blurred ownership. Often, there is no single person responsible for a host or service, which makes things harder. It is also typical to have a service with subtle dependencies on other things, which by themselves are small enough not to be monitored by Nagios. In such a case, it is good to include lower management in the escalations so that they are able to focus on problems that haven't been resolved in a timely manner.

Here is a good example: a database server might fail because a small Perl script that is run prior to actual start and clean things up has entered an infinite loop. The owner of this machine gets notified. But the question is who should be fixing it? Should it be the script owner? Or perhaps, should it be the database administrator? In IT reality, this often ends up in a series of throwing ball into each other's yards without solving anything.

In such cases, escalations are a great way to solve such complex problems. In the previous example, if the problem is not been resolved after two hours, the IT team coordinator or manager would be notified. Another hour later, he would get another email. At that point, he would schedule an urgent meeting with the developer who owns the script, and the database admin, to discuss how this could be solved.

Of course, in real-world scenarios, escalating to management alone would not solve all problems. However, often, situations need a coordinator that will take care of communicating issues between teams and trying to find a company-wide solution. Business-critical services also require much higher attention. In such cases, it is a real benefit for the company if it has an escalation ladder that can be followed for all major problems.

Nagios offers many ways to set up escalations, depending on your needs. Escalations do not need to be sent out just after a problem occurs—that would create confusion and prevent smaller problems from being solved. Usually, escalations are set up so that additional people are informed only if a problem has not been resolved after a certain amount of time.

From a configuration point of view, all escalations are defined as separate objects. There are two types of objects—hostescalation and serviceescalation. Escalations are configured so that they start and stop being active along with the normal host or service notifications. This way, if you change the notification_ interval directive in host or service definition, the times at which escalations start and stop will also change.

A sample escalation for company's main router is as follows:

define hostescalation
{
host_name mainrouter
contactgroups it-management
first_notification 2
last_notification 0
notification_interval 60
escalation_options d,u,r
}

The following table describes all available directives for defining a host escalation. Items in bold are required when specifying an escalation.

<!--[if gte mso 9]> Normal 0 false false false EN-US X-NONE X-NONE MicrosoftInternetExplorer4 <![endif]--><!--[if gte mso 9]> <![endif]-->

<!--[if gte mso 10]> <! /* Style Definitions */ table.MsoNormalTable {mso-style-name:"Table Normal"; mso-tstyle-rowband-size:0; mso-tstyle-colband-size:0; mso-style-noshow:yes; mso-style-priority:99; mso-style-qformat:yes; mso-style-parent:""; mso-padding-alt:0in 5.4pt 0in 5.4pt; mso-para-margin:0in; mso-para-margin-bottom:.0001pt; mso-pagination:widow-orphan; font-size:11.0pt; font-family:"Calibri","sans-serif"; mso-ascii-font-family:Calibri; mso-ascii-theme-font:minor-latin; mso-fareast-font-family:"Times New Roman"; mso-fareast-theme-font:minor-fareast; mso-hansi-font-family:Calibri; mso-hansi-theme-font:minor-latin; mso-bidi-font-family:"Times New Roman"; mso-bidi-theme-font:minor-bidi;} --> <!--[endif]-->

<!--[if gte mso 9]> Normal 0 false false false EN-US X-NONE X-NONE MicrosoftInternetExplorer4 <![endif]--><!--[if gte mso 9]> <![endif]-->

<!--[if gte mso 10]> <! /* Style Definitions */ table.MsoNormalTable {mso-style-name:"Table Normal"; mso-tstyle-rowband-size:0; mso-tstyle-colband-size:0; mso-style-noshow:yes; mso-style-priority:99; mso-style-qformat:yes; mso-style-parent:""; mso-padding-alt:0in 5.4pt 0in 5.4pt; mso-para-margin:0in; mso-para-margin-bottom:.0001pt; mso-pagination:widow-orphan; font-size:11.0pt; font-family:"Calibri","sans-serif"; mso-ascii-font-family:Calibri; mso-ascii-theme-font:minor-latin; mso-fareast-font-family:"Times New Roman"; mso-fareast-theme-font:minor-fareast; mso-hansi-font-family:Calibri; mso-hansi-theme-font:minor-latin; mso-bidi-font-family:"Times New Roman"; mso-bidi-theme-font:minor-bidi;} --> <!--[endif]-->

Option

Description

host_name

Defines host names that escalation should be defined for; separated by comma

hostgroup_name

Defines host group names for all members of which groups escalation should be defined for; separated by comma

contacts

List of all contacts that should receive notifications related to this escalation; separated by comma; at least one contact or contact group needs to be specified for each escalation

contactgroups

List of all contacts groups that should receive notifications related to this escalation, separated by comma; at least one contact or contact group needs to be specified for each escalation

first_notification

Number of notifications after which this escalation becomes active; setting this to 0 causes notifications to be sent until host recovers from problem; see description below

last_notification

Number of notifications after which this escalation stop being active; see description below

notification_interval

Specifies number of minutes between sending notifications related to this escalation

escalation_period

Specifies time period during which escalation should be valid; if not specified defaults to 24 hours a day 7 days a week

escalation_options

Specifies which notification types for host states should be sent, separated by comma; should be one or more of the following:
d - host DOWN state
u - host UNREACHABLE state
r - host recovery (UP state)

Service escalations are defined in a very similar way to host escalations. You can specify one or more hosts or host groups, as well as a single service description. Service escalation will be associated with this service on all hosts mentioned in the host_name and hostgroup_name attributes.

The following is an example of a service escalation for an OpenVPN check on the company's main router:

define serviceescalation
{
host_name mainrouter
service_description OpenVPN
contactgroups it-management
first_notification 2
last_notification 0
notification_interval 60
escalation_options w,c,r
}

The following table describes all available directives for defining a service escalation. Items in bold are required when specifying an escalation.

Learning Nagios 3.0 A comprehensive configuration guide to monitor and maintain your network and systems
Published: October 2008
eBook Price: $23.99
Book Price: $39.99
See more
Select your format and quantity:

Option

Description

host_name

Defines host names that escalation should be defined for; separated by comma

hostgroup_name

Defines host group names for all members of which groups escalation should be defined for; separated by comma

service_description

Service for which the escalation is set up

contacts

List of all contacts that should receive notifications related to this escalation; separated by comma; at least one contact or contact group needs to be specified for each escalation

contactgroups

List of all contacts groups that should receive notifications related to this escalation, separated by comma; at least one contact or contact group needs to be specified for each escalation

first_notification

Number of notifications after which this escalation becomes active; see description below

last_notification

Number of notifications after which this escalation stop being active; setting this to 0 causes notifications to be sent until service recovers from problem; see description below

notification_interval

Specifies number of minutes between sending notifications related to this escalation

escalation_period

Specifies time period during which escalation should be valid; if not specified defaults to 24 hours a day 7 days a week

escalation_options

Specifies which notification types for service states should be sent, separated by comma; should be one or more of the following:
r - service recovers (OK state)
w - service WARNING state
c - service CRITICAL state
u - service UNKNOWN state

Let's consider the following configuration—a service along with two escalations:

define service
{
use generic-service
host_name mainrouter
service_description OpenVPN
check_command check_openvpn_remote
check_interval 15
max_check_attempts 3
notification_interval 30
notification_period 24x7
}
# Escalation 1
define serviceescalation
{
host_name mainrouter
service_description OpenVPN
first_notification 4
last_notification 8
contactgroups it-escalation1
notification_period 24x7
notification_interval 15
}
# Escalation 2
define serviceescalation
{
host_name mainrouter
service_description OpenVPN
first_notification 8
last_notification 0
contactgroups it-escalation2
notification_period 24x7
notification_interval 120
}

In order to show how the escalations work, let's take an example—a failing service. A service fails for a total of 16 hours and then recovers—for the clarity of the example, we'll skip the soft and hard states and the timing required for hard state transitions.

Service notifications are set up so that the first notification is sent out 30 minutes after failure. Later on, they are repeated every 60 minutes and then the next notification is sent 1.5 hours after the actual failure and so on. The service also has two escalations defined for it.

Escalation 1 is first triggered along with the fourth service notification that is sent out. Escalation stops being active after the eighth service notification on the failure. It only sends out reports about problems, not recovery. The interval for this escalation is configured to be 15 minutes.

Escalation 2 is first triggered along with the eighth service notification and never stops— the last_notification directive is set to 0. It sends out reports about problems and recovery. The interval for this escalation is configured to 2 hours.

Learning Nagios 3.0

The diagram above shows when both escalations are sent out. Notifications for the service itself are sent out 0.5, 1.5, 2.5, 3.5 … hours after the occurrence of the initial service failure.

Escalation 1 becomes active after 3.5 hours—which is when the fourth service notification is sent out. The last notification related to escalation 1 is sent out 7.5 hours after the initial failure—this is the time when the eighth service notification is sent out. It is sent every 30 minutes; so a total of nine notifications related to escalation 1 are sent out.

Escalation 2 becomes active after 7.5 hours – which is when the eighth service notification is sent out. The last notification related to escalation 2 is sent out when the problem is resolved, and concerns the actual problem resolution. It is sent every two hours, so a total of four notifications related to escalation 2 are sent out.

Escalations can be defined to be independent of each other—there is no reason why Escalation 2 cannot start after the sixth service notification is sent out. There are also no limits on the number of escalations that can be set up for a single host or service.

The main point is that escalations should be defined reasonably, so that they don't bloat management or other teams with problems that would be solved without their interference anyway.

Escalations can also be used to contact different people for a certain set of objects, based on time periods. If an escalation has the first_notification set to 1 and the last_notification set to 0, then all notifications related to this escalation will be sent out exactly in the same way as notifications for the service itself.

For example, normal IT staff may be handling problems normally, but during holidays, if notifications about problems should also go to the CritSit team, then you can simply define an escalation saying that during the holidays time period, CritSit group should also be notified about problems when the first notification is sent out. The following is an example that is based on the OpenVPN service defined earlier:

define serviceescalation
{
host_name mainrouter
service_description OpenVPN
first_notification 1
last_notification 0
contactgroups CritSit
notification_period holidays
notification_interval 30
escalation_options w,c,r
}

The definitions above specify both the service and its escalation. Please note that the notification_interval is set to the same value in both the object and the escalation.

Summary

Nagios offers several ways to let people know that something is wrong. Notifications can range from simple emails to a complex system that deals with multiple ways to communicate problems, as well as the ability to choose the appropriate way dynamically. This will help eliminate people from having to deal with their emails, and will help in resolving issues much more effectively.

Nagios also has a very powerful mechanism for escalating problems. When set up correctly, this is a very useful tool which will aid in complex problem resolution. In the case of larger problems, it can also be used to communicate problems properly so that a continuity plan can be put in place to prevent long outages to critical services.

Escalations also have all of the benefits of normal notifications—they can also be sent out in any way you might think of, and people will have the same power to set it up conveniently for themselves.

Thus, we have covered the following in the first half of this series:

  • Effective Notifications
  • Escalations

In the next part of this series, we will cover the following:

  • External Commands
  • Event Handlers
  • Modifying Notifications
  • Adaptive Monitoring

 

If you have read this article you may be interested to view :

Learning Nagios 3.0 A comprehensive configuration guide to monitor and maintain your network and systems
Published: October 2008
eBook Price: $23.99
Book Price: $39.99
See more
Select your format and quantity:

About the Author :


Wojciech Kocjan

Wojciech Kocjan is a system administrator and programmer with 10 years of experience. His work experience includes several years of using Nagios for enterprise IT infrastructure monitoring. He also has experience in large variety of devices and servers, routers, Linux, Solaris, AIX servers and i5/OS mainframes. His programming experience includes multiple languages (such as Java, Ruby, Python, and Perl) and focuses on web applications as well as client-server solutions.

Books From Packt

 

Building Enterprise Ready Telephony Systems with sipXecs 4.0: RAW
Building Enterprise Ready Telephony Systems with sipXecs 4.0: RAW

eZ Publish 4: Enterprise Web Sites Step-by-Step
eZ Publish 4: Enterprise Web Sites Step-by-Step

Apache Maven 2 Effective Implementations: RAW
Apache Maven 2 Effective Implementations: RAW

Pentaho Reporting 1.0 for Java Developers
Pentaho Reporting 1.0 for Java Developers

Asterisk 1.4 – the Professional’s Guide
Asterisk 1.4 – the Professional’s Guide

Drupal 5 Views Recipes
Drupal 5 Views Recipes

Scratch 1.3: Beginner’s Guide
Scratch 1.3: Beginner’s Guide

Choosing an Open Source CMS: Beginner's Guide
Choosing an Open Source CMS: Beginner's Guide

 

No votes yet

Post new comment

CAPTCHA
This question is for testing whether you are a human visitor and to prevent automated spam submissions.
s
S
P
x
R
E
Enter the code without spaces and pay attention to upper/lower case.
Code Download and Errata
Packt Anytime, Anywhere
Register Books
Print Upgrades
eBook Downloads
Video Support
Contact Us
Awards Voting Nominations Previous Winners
Judges Open Source CMS Hall Of Fame CMS Most Promising Open Source Project Open Source E-Commerce Applications Open Source JavaScript Library Open Source Graphics Software
Resources
Open Source CMS Hall Of Fame CMS Most Promising Open Source Project Open Source E-Commerce Applications Open Source JavaScript Library Open Source Graphics Software