Learning Nagios 3.0

Chapter 1. Introduction

Imagine you're working as an administrator of a large IT infrastructure. You have just started receiving emails that a web application has stopped working. When you try to access the same page, it just doesn't load. What are the possibilities? Is it the router? Or the firewall? Perhaps the machine hosting the page is down? Before you even start thinking rationally about what is to be done, your boss calls about the critical situation and demands an explanation. In this panic situation, you'll probably start plugging everything in and out of the network, rebooting the machine and so on, and that doesn't help.

After hours of nervously digging into the issue you finally find the solution — the web server was working properly, but was timing out on communication with the database server. This was because the machine with the database was not getting a correct IP as yet another box had run out of memory and Dynamic Host Configuration Protocol (DHCP) server had stopped working. Imagine how much time it would take to find all that out manually. It would be a nightmare if the database server was in another branch of the company, in a different time zone, and perhaps the people over there were still sleeping.

And what if you had Nagios up and running across your entire company? You would just need to go to the web interface, see that there are no problems with the web server and the machine it is running on. There would also be a list of what's wrong - that the machine serving IP addresses to the entire company is not doing its job and that the database is down. If the set-up also monitored the DHCP server, you would get a warning email that very little swap memory is available on it, or that too many processes are running. Maybe it would even have an event handler for such cases to just kill or restart noncritical processes. Also, Nagios would try to restart the DHCP server process over the network, in case it is down.

In the worst case, Nagios would speed up hours of investigation to 10 minutes. In the best case, you would just get an email that there was a problem, followed by another one saying that the problem is already fixed. You would just disable a few services and increase the swap size for the DHCP machine and solve the problem once for all. And nobody would even notice there was a problem.

Introduction to Nagios

According to WikiPedia (http://en.wikipedia.org/wiki/System_Monitoring) Nagios is a tool for system monitoring. This means that it constantly checks the status of machines and various services on those machines. The main purpose of system monitoring is to detect and report on any system not working properly, as soon as possible, so that, you are aware of the problem before the user runs into it.

Nagios does not perform any host or service checks on its own. It uses plugins to perform the actual checks. This makes it a very modular and flexible solution for performing machine and service checks.

Objects monitored by Nagios are split into two categories: hosts and services. Hosts are physical machines (servers, routers, workstations, printers and so on), while services are particular functionalities, for example, a web server (an httpd process on a machine) can be defined as a service to be monitored. Each service is associated with a host it is running on. In addition, both machines and services can be grouped into host and service groups, accordingly. We will look into the details of each of these types of objects in the next section.

Nagios has two major strengths when it comes to scanning — first of all, instead of monitoring values, it only uses four states to describe status: OK, WARNING, CRITICAL, and UNKNOWN. The approach of only offering abstract states allows administrators to ignore monitoring values and just decide on what the warning/critical limits are. Having a strict limit to watch out for is much better as you always catch a problem regardless of whether it turns from a warning to a critical limit in 15 minutes or in a week. This is exactly what Nagios does. If you are monitoring a numeric value such as the amount of disk space and CPU usage, you can define thresholds for the values which are considered correct, a warning, or a failure. For example, system administrators tend to ignore things such as a slow decline in storage space. People often ignore it until a critical process runs out of disk space.

Another benefit is that a report states the number of services that are up and running in both warning state and critical state. Such a report offers a good overview of your infrastructure status. Nagios also offers similar reports for host groups and service groups, say when any critical service or database server is down. Such a report can also help prioritize what needs to be dealt with first, and which problems can be handled later.

Nagios performs all of its checks using plugins. These are external components to which Nagios passes information on what should be checked and what the warning and critical limits are. Plugins are responsible for doing the checks and analyzing the results. The output from such a check is a status (OK, WARNING, CRITICAL, or UNKNOWN) and additional text providing information on the service in detail. This text is primarily intended for system administrators to be able to read a detailed status of a service.

Nagios not only offers a core system for monitoring, but also offers a set of standard plugins in a separate package (see http://nagiosplugins.org/ for more details). These plugins allow checks for almost all of the services your company might have. Refer to Chapter 4, Overview of Nagios Plugins, for detailed information on plugins that are developed along with Nagios. If you need to perform a specific check (for example, to connect to a web service and invoke methods), it is very easy to write your own plugins. And that's not all — they can be written in any language, and it takes less than a quarter of the time it takes to write a complete check command! Chapter 11 Extending Nagios talks about this in more detail.

Benefits of Monitoring Resources

There are many reasons why you should make sure that all of your resources are working as expected. If you're still not convinced after reading the introduction to this chapter, here are a few main points why it is important to monitor your infrastructure.

The main advantage is the improvement in quality. If your IT staff can notice failures more quickly, they will also be able to respond to them much faster. Sometimes, it takes hours or days to get the first report of a failure even if many users are bumping into errors. Nagios will make sure that if something is not working, you know about it.

It is also possible to make Nagios perform recovery actions automatically. This is done using event handlers. These are commands that are run after the status of a host or service has changed — this way when a primary router is down, Nagios will switch to a backup solution until the primary one is fixed. A typical case would be to start a dial-up connection as a fallback, in case VPN is down.

Another advantage is much better problem determination. Very often, what the users report as a failure is far from the root cause of the problem — an email system being down due to LDAP service not working correctly. If you define dependencies between hosts correctly, Nagios will point out that the POP3 email server is assumed to be not working because the LDAP service, which it depends upon, has a problem. Nagios will start checking the email server as soon as the problem with LDAP has been resolved.

Nagios is also very flexible when it comes to notifying people about what isn't functioning correctly. You can set it up to send emails to different people depending on what is not functioning properly. In most of the cases, your company has a large IT team or multiple teams. Usually you want some people to handle servers, and others to handle network switches/routers/modems. You can even use Nagios' web interface to manage who is working on what issue. You can also configure how Nagios sends notifications via email, pager over Jabber, MSN, or by using your own scripts.

Monitoring resources is not only useful for identifying problems; it can also save you from running into them. Nagios handles warnings and critical situations differently. This means that it's possible to recognize potentially problematic situations quickly. For example, if your disk storage on an email server is running out, it's better to be aware of this situation before it becomes a critical issue.

Monitoring can also be set up on multiple machines across various locations that can communicate all their results to a central Nagios server. This way, information on all hosts and services in your system can be accessed from a single machine. This gives you a more complete picture of your IT infrastructure, and also allows for testing of more complex things such as firewalls.

Main Features

Nagios' main strength is its flexibility — it can be configured to monitor your IT infrastructure in the way you want. It also has a mechanism to automatically react to problems, and a powerful notification system. All of this is based on a clear object definition system and on a few object types:

1. Commands are definitions of how Nagios should perform particular types of checks; they are an abstraction layer on top of the actual plugins that allow you to group similar types of operations.
2. Time periods are date and time spans within which an operation should or should not be performed; for example: Monday to Friday between 09:00 and 17:00.
3. Contacts and contact groups are people who should be notified, along with information on how and when they should be contacted. Contacts can be grouped and a single contact can be a member of more than one group.
4. Host are physical machines, along with information on who should be contacted, how checks should be performed, and when. Hosts can be grouped; into host groups each host may be a member of more than one host group.
5. Services are various functionalities or resources to monitor a specific host, along with information on who should be contacted, how the checks should be performed, and when. Services can be grouped into service groups; each service may be a member of more than one service group.
6. Host and service escalations define the specific time period after which additional people should be notified of certain events — for example a critical server being down for more than 4 hours should alert IT management so that they start tracking the issue. These people are defined in addition to the normal notifications configured in the host and service objects.

An important benefit that you will gain by using Nagios is a mature dependency system. For any administrator, it is obvious that if your router is down, all machines accessed through it will fail. Some systems don't take that into account and in such a case, you would get a list of several failing machines and services. Nagios allows you to define dependencies between hosts to reflect your actual network topology. For example, if a switch that connects you to a router is down, Nagios will not perform any checks on the router or on the machines that are dependant on the router. This is illustrated in the following example:

You can also define that one particular service depends on another service; either on the same host or on a different host. If one of the services is down, a check for a service that depends on it is not performed. For example, for your company's intranet application to function properly, both an underlying web server and a database server must be running. So, if a database service is not working properly, Nagios will not perform checks on your application. The database server might be on the same host or on a different host. In such a case, if the machine is down or not accessible, notifications for all services dependent on the database service will not be sent out either.

Nagios offers a consistent system of macro definitions. These are the variables that can be put into all object definitions, depending on what the context is. They can be put inside commands, and depending on host, service, and many other parameters, values are substituted accordingly. For example, a command definition might use the IP address of the host it is currently checking in all remote tests. This also makes it possible to put information such as the previous and current statuses of a service in a notification email. Nagios 3 also offers various extensions to macro definitions, which makes it an even more powerful mechanism. This is described in detail in the last section of this chapter.

Nagios also offers mechanisms for scheduling planned downtimes. You can schedule that a particular host or service is planned to be unavailable. This will prevent Nagios from notifying people to be contacted regarding the problems related to these objects. Nagios can also notify people of planned downtimes automatically. This is mainly used when maintenance of the IT infrastructure is to be carried out, and the servers and/or services they provide are unavailable for a long time. This allows the creation of an integrated process of scheduling downtimes that will also handle informing the users.

Soft and Hard States

Nagios works by checking if a particular host or service is working correctly and storing its status. Because the status of a service is only one of the four possible values, it is crucial that it actually reflects what the current status is. In order to avoid detecting random and temporary problems, Nagios uses soft and hard states to describe what the current status of a host or service is.

Imagine that an administrator is restarting a web server and this operation makes connection to the web pages unavailable for five seconds. As, usually, such restarts are done at night to lower the number of users affected, this is an acceptable period of time. However, a problem might arise when Nagios tries to connect to the server and notices that it is actually down. If it relies only on a single result, Nagios would trigger an alert that a web server is down. It would actually be up and running again in a few seconds, but it could take a couple of minutes for Nagios to find that out.

To handle situations when a service is down for a very short time, or the test has temporarily failed, soft states were introduced. When the status of a check is unknown, or it is different from the previous one, Nagios will retest the host or service several times to make sure that the change is persistent. The number of checks is specified in the host or service configuration. Nagios assumes that the new result is a soft state. After additional tests have verified that the new state is permanent, it is considered a hard state.

Each host and service definition specifies the number of retries to be performed before it can be assumed that a change is permanent. This allows more flexibility over how many failures should be treated as an actual problem instead of a temporary one. Setting the number of checks to one will cause all changes to be treated as hard instantly. The following is an illustration of soft and hard state changes, assuming that number of checks to be performed is set to three:

This feature allows ignoring short outages of a service. It is also very useful for performing checks that can periodically fail even if everything is working correctly. Monitoring devices over SNMP is also an example where a single check might fail, but the check will eventually succeed during the second or third check.

What's New in Nagios 3.0?

Note

This section is primarily intended for people who are already familiar with Nagios functionality and want to know what has been added in the new version. If you are not experienced with Nagios, not all issues mentioned in this section may be clear to you.

The new Nagios version comes with a bunch of new functionality and fixes. However, this section covers only the most important ones. It is recommended that you view the complete Changelog file that comes with all distributions of Nagios, or the Nagios documentation.

Macro handling is one area where there have been numerous changes. The most important improvement you might notice is that 40 new macros have been added. A notable incompatible change is that the $NOTIFICATIONNUMBER$ macro has been removed in favor of the $HOSTNOTIFICATIONNUMBER$ and $SERVICENOTIFICATIONNUMBER$ macros. Moreover, macros have now been set as environment variables so that your scripts can access them easily. Because this can cause performance issues, there is also an option to disable the environment variables settings of all Nagios macros.

There have been significant changes to how Nagios stores information. The main reason behind this is that Nagios now allows plugins to return multiple lines with information or performance data. This allows more detailed information about hosts and services to be stored. The format of Nagios information and retention files has changed to adapt to this functionality. In the previous format for storing status and many other files, each line was used to represent a single object. This information is now stored in a format similar to Nagios configuration. This requires changes for all applications that read service statuses directly from the Nagios files.

The previous versions of Nagios stored scheduled downtimes as well as host and service comments in separate files. Version 3.x introduced a retention file that stores various information related to hosts and services. This file now contains a list of scheduled downtimes and comments related to each item. It also allows the storage of more information on Nagios restarts, which would be useful when performing frequent restarts.

The embedded Perl interpreter is where a lot has changed under the hood as well. You can now decide whether or not to use embedded Perl in the configuration at compilation time, as it was with 1.x and 2.x. In addition, individual plugins may decide if they want to force enabling or disabling embedded Perl usage. This would mainly be useful if a few of your Perl plugins cause problems with embedded Perl and you don't want to lose the functionality of other plugins that also have embedded Perl.

Now, it is also possible to change the frequency of monitoring hosts or services on the fly. This can be done by sending proper commands to the Nagios daemon. This functionality can be used to modify how often an object should be checked, as well as the time periods during which checks should be performed and notifications should be sent out.

A large improvement has been made in terms of host and service definitions. The major difference is that inheritance can now be done from more than object — this means that your host definition can inherit some attributes from one template and the remaining attributes from another. Services also inherit all contact and notification settings from the host they are running on, unless otherwise specified.

Starting with Nagios 3, it is possible to specify group members by specifying other groups. All objects that are members of specified groups will also be members of this group. For example, when defining a host group, it is possible to specify other host groups. All hosts who are members of such a group will also be members of the currently-described group.

Host checks have also been improved in various ways. Starting with Nagios 3, all checks will be done in parallel, which speeds up Nagios performance enormously. Host check retry handling has also been improved, and now uses the same logic as the service checks.

The dependency system has also been improved in Nagios 3. It is now much easier to configure complex dependencies. This also allows defining dependencies between services on the same host by not specifying dependent host name. It is also possible to set up periods for which dependencies are valid. If nothing is specified, a dependency is valid all the time.

Nagios 3 also introduces support for pre-caching object information. This means that instead of reading all parameters from the configuration file and creating a dependency set for all types of objects, Nagios is able to save its internal representation of the data, so that the next time it starts, it reads the cache instead of re-analyzing the configuration.

Defining time periods also is more powerful now. It is possible to specify date exceptions. For example, combining a definition of Monday to Friday from 9 AM to 5 PM with national holidays will offer a more precise definition of working hours. It is also possible to use a skip date in the time period; for example, 'every 3 days'.

For a complete list of changes, please visit Nagios 3.0 documentation website: http://nagios.sourceforge.net/docs/3_0/whatsnew.html.