Troubleshooting Nagios 3.0

In this article by Wojciech Kocjan, we will learn about troubleshooting Nagios 3.0 which includes troubleshooting the web interface, passive checks, SSH-Based checks, and NRPE.The article includes various possible errors along with their solutions and detailed explanations for each error listed out.

Troubleshooting Web Interface

There might be cases where accessing the Nagios URL shows an error instead of the welcome screen. If this happens, it can be due to various reasons, for example, because the web server has not started, or the Nagios related configuration setup is incorrect, or permissions on the Nagios directories are incorrect.

The first thing that we should check is whether Apache is working properly. We can manually run the check_http plugin from Nagios. If the web server is up and running, we should see something similar to what is shown here:

 # /opt/nagios/plugins/check_http -H 127.0.0.1
HTTP OK HTTP/1.1 200 OK - 296 bytes in 0.006 seconds

and if Apache is not running currently, the plugin will report an error similar to the following one:

# /opt/nagios/plugins/check_http -H 127.0.0.1
HTTP CRITICAL - Unable to open TCP socket

If it was stopped, start it by running /etc/init.d/apache2 start.

The next step is to check whether the http://127.0.0.1/nagios/ URL is working properly. We can also use the same plugin for this. The -u argument can specify the exact link to access, and -a allows you to specify the username and password to be authorized. It is passed in the form of <username>:<password>.

# /opt/nagios/plugins/check_http -H 127.0.0.1 
u /nagios/ -a nagiosadmin:<yourpassword>
HTTP OK HTTP/1.1 200 OK - 979 bytes in 0.019 seconds

We can also check the actual CGI scripts by passing a URL to one of the scripts:

# /opt/nagios/plugins/check_http -H 127.0.0.1 
u /nagios/cgi-bin/tac.cgi -a nagiosadmin:<yourpassword>
HTTP OK HTTP/1.1 200 OK - 979 bytes in 0.019 seconds

If any of these checks return any HTTP code other than 200, it means that this is the problem.

If the code is 500, it means that Apache is not configured correctly. In such cases, the Apache error log contains useful information about any potential problems. On most systems, including Ubuntu Linux, the filename of the log is /var/log/apache2/error.log. An example entry in the error log could be:

[error] [client 127.0.0.1] need AuthName: /nagios/cgi-bin/tac.cgi

In this particular case, the problem is the missing AuthName directive for CGI scripts.

Internal errors can usually be resolved by making sure that the Nagios-related Apache configuration is correct.

If this does not help, it is worth checking other parts of the configuration, especially the ones related to virtual hosts and CGI configuration. Commenting out parts of the configuration can help in determining which parts of the configuration are causing problems.

Another possibility is that either the check for /nagios/ or the check for the /nagios/cgi-bin/tac.cgi URL returned code 404. This code means that the page was not found. In this case, please make sure that Apache is configured according to the previous steps.

If it is, then it's a good idea to enable more verbose debugging to a custom file. The following Apache 2 directives can be added either to /etc/apache2/conf.d/nagios or to any other file in Apache configuration:

LogFormat "%h %l %u "%r" %>s %b %{Host}e %f" debuglog
CustomLog /var/log/apache2/access-debug.log debuglog

The first entry defines a custom logging format that also logs exact paths to files. The second one enables logging with this format to a dedicated file. An example entry in such a log would be:

127.0.0.1 - - "GET /nagios/ HTTP/1.1" 404 481 127.0.0.1 /var/www/nagios

This log entry tells us that http://127.0.0.1/nagios/ was incorrectly expanded to the /var/www/nagios directory. In this case, the Alias directive describing the /nagios/ prefix is missing. Making sure that actual configuration matches the one provided in the previous section will also resolve this issue.

Another error that you can get is 403, which indicates that Apache was unable to access either CGI scripts in /opt/nagios/sbin, or Nagios static pages in /opt/nagios/share. In this case, you need to make sure that these directories are readable by the user Apache is running as.

The error might also be related to the directories above /opt/nagios or /opt. One of these might also be inaccessible to the user Apache is running as, which will also cause the same error to occur.

If you run into any other problems, it is best to start with making sure that Nagios related configuration matches the examples from the previous section. It is also a good idea to reduce the number of enabled features and virtual hosts in your Apache configuration.

Troubleshooting Passive Checks

It' s not always possible to set up passive checks correctly the first time. In such cases, it is a good thing to try to debug the issue one step at a time in order to find any potential problems. Sometimes the problem could be a configuration issue, while in other cases, it could be an issue such as the mistyping of the host or service name.

One thing worth checking is whether the Web UI shows changes after you have sent the passive result check. If it doesn't, then at some point, things are not working correctly.

The first thing you should start with is enabling the logging of external commands and passive checks. To do this, make sure that the following values are enabled in the main Nagios configuration file:

log_external_commands=1
log_passive_checks=1

In order for the changes to take effect, a restart of the Nagios process is needed. After this has been done, Nagios will log all commands passed via the command pipe and log all of the passive check results it receives.

The first issue, a common problem, is that an application or script cannot write data to the Nagios command pipe. In order to test this, simply change to the user your scripts are running as, and try the following command:

user@ubuntuserver:~$ echo "TEST" >/var/nagios/rw/nagios.cmd

If the command above runs fine, and no errors are reported, then your permissions are set up correctly. If an error shows up, you should add the user to the nagioscmd group. The next thing to do is to manually send a passive check result to the Nagios command pipe and check whether the Nagios log file was received and parsed correctly. To test this, run the following command:

echo "['date +%s'] PROCESS_HOST_CHECK_RESULT;host1;2;test" 
>/var/nagios/rw/nagios.cmd

The name, host1, needs to be replaced with an actual host name from your configuration. A few seconds after running this command, the Nagios log file should reflect the command that we have just sent. You should see the following lines in your log:

EXTERNAL COMMAND: PROCESS_HOST_CHECK_RESULT;host1;2;test
[1220257561] PASSIVE HOST CHECK: host1;2;test

If both of these lines are in your log file, then we can conclude that Nagios has received and parsed the command correctly.

If only the first line is present, then it means that either the global option to receive passive host check results is disabled, or it is disabled for this particular object. The first thing you should do is to make sure that your main Nagios configuration file contains the following line:

accept_passive_host_checks=1

Next, you should check your configuration to see whether the host definition has passive checks enabled as well. If not, simply add the following directive to the object definition:

passive_checks_enabled 1

If you have misspelled the name of the host object, then the following will be logged:

Warning: Passive check result was received for host host01',
but the host could not be found!

In this case, make sure that your host name is correct.

Similar checks can also be done for services. You can run the following command to check if a passive service check is being handled correctly by Nagios:

echo "['date +%s'] PROCESS_SERVICE_CHECK_RESULT;host1;APT;0;test" 
>/var/nagios/rw/nagios.cmd

Again, host1 should be replaced by the actual host name, and APT needs to be an existing service for that host. After a few seconds, the following entries in Nagios log file would indicate the result has been successfully parsed:

EXTERNAL COMMAND: PROCESS_SERVICE_CHECK_RESULT;host1;APT;0;test
PASSIVE SERVICE CHECK: host1;APT;0;test

If the second line is not in the log file, either the option to accept service passive checks is disabled on a global basis, or this particular service has the option to accept passive check results disabled. You should start by making sure that your main Nagios configuration file contains the following line:

accept_passive_service_checks=1

You should also make sure that the service definition has passive checks enabled as well, and if not, add the following directive to the object definition:

passive_checks_enabled 1

If you have misspelled the name of the host or service, then the following will be logged:

Warning: Passive check result was received for service APT' on host 
host1', but the service could not be found!
    Learning Nagios 3.0
Learning Nagios 3.0
  • A comprehensive configuration guide to monitor and maintain your network and systems
  • Secure and monitor your network system with open-source Nagios version 3
  • Set up, configure, and manage the latest version of Nagios
  • In-depth coverage for both beginners and advanced users

http://www.packtpub.com/guide-for-learning-nagios-3/book

 

Troubleshooting SSH-Based Checks

If you have followed the steps from the previous sections carefully, then most probably, everything should be working smoothly. However, in some cases, your setup might not be working properly, and you will need to find the root cause of the problem.

The first thing that you should start with is to use the check_ssh plugin to make sure that SSH is accepting connections on the host we are checking. For example, we can run the following command:

root@ubuntu1:~# /opt/nagios/plugins/check_ssh -H 192.168.2.51
SSH OK - OpenSSH_4.7p1 Debian-8ubuntu1.2 (protocol 2.0)

Where 192.168.2.51 is the name of IP address of the remote machine we want to monitor. If no SSH server is set up on the remote host, the plugin will return Connection refused status, and if it failed to connect, the result will state No route to host.

In these cases, you need to make sure SSH server is working, and that all routers and firewalls do not filter out connections for SSH which is TCP port 22.

Assuming SSH server is accepting connections, the next thing that can be checked is whether SSH key-based authorization works correctly. To do this, switch to the user the Nagios process is running as. Next, try to connect to the remote machine. The following are sample commands to perform this check:

root@ubuntu1:~# su nagios -
$ ssh -v 192.168.2.51

This way, you will check the connectivity as the same user that Nagios is running checks at. You can also analyze the logs that will be printed to the standard output. If SSH client will prompt you for a password, then your keys are not set up properly. It is a common mistake to set up keys on the root account instead of setting them up on the Nagios account. If this is the case, then create a new set of keys as a correct user and verify whether these keys work correctly now.

Assuming this step worked fine, the next thing to be done is to check whether invoking an actual check command produces correct results. For example:

root@ubuntu1:~# su nagios -

$ ssh 192.168.2.51 /opt/nagios/plugins/check_procs

PROCS OK: 51 processes

This way, you will check the connectivity as the same user that Nagios is running checks at.

The last check is to make sure that the check_by_ssh plugin also returns correct information. An example of how to do this is as follows:

root@ubuntu1:~# su nagios -

$ /opt/nagios/plugins/check_by_ssh -H 192.168.2.1

/opt/nagios/plugins/check_procs

PROCS OK: 52 processes

If the last step also worked correctly, it means that all check commands are working correctly. If you still have issues with running the checks, then the next thing you should investigate is if Nagios has been properly configured, whether all commands, hosts, and services are set up in the correct way.

Troubleshooting NRPE

Our NRPE configuration should now be complete and working as expected.

However, in some cases, for example, if there is a firewall issue or an issue invalid configuration, the NRPE based checks may not work correctly. There are some steps that you can take to determine the root cause of the problem.

The first thing that should be checked is whether Nagios server can connect to the NRPE process on the remote machine. Assuming that we want to use NRPE on 192.168.2.1, we can check if NRPE accepts connections by using check_tcp from the Nagios plugins. By default, NRPE uses port 5666, which we'll also use in the following example, which shows how to check this:

$ /opt/nagios/plugins/check_tcp -H 192.168.2.1 -p 5666

TCP OK - 0.009 second response time on port 5666|time=0.008794s;;;0.00

0000;10.000000

If NRPE is not set up on the remote host, the plugin will return Connection refused. If the connection could not be established, the result will be No route to host. In these cases, you need to make sure that the NRPE server is working and that, the traffic that the TCP port NRPE is listening on is not blocked by the firewalls.

The next step is to try to run an invalid command and check the output from the plugin. The following is an example that assumes that the dummycommand is not defined in the NRPE configuration on the remote machine:

$ /opt/nagios/plugins/check_nrpe -H 192.168.2.1 -c dummycommand

NRPE: Command 'dummycommand' not defined

If you received a CHECK_NRPE: Error - Could not complete SSL handshake error or something similar, it means that NRPE is not configured to accept connections from your machine either via the allowed_hosts option in the NRPE configuration, or in the inetd configuration.In order to check this, log on to the remote machine and search the system logs for nrpe. For example, on most systems, this would be:

# grep nrpe /var/log/syslog /var/log/messages

(...)

ubuntu1 nrpe[3023]: Host 192.168.2.13 is not allowed to talk to us!

This indicates that your Nagios server is not added to the list of allowed hosts in the NRPE configuration. Add it in the allowed_hosts option and restart the NRPE process.

Another error message that could be returned by the check_nrpe command is CHECK_NRPE: Received 0 bytes from daemon. Check the remote server logs for error messages. This message usually means that you have passed arguments or invalid characters in the command name and the NRPE server refused the request because of these. Looking at the remote server's logs will usually provide more detailed information:

# grep nrpe /var/log/syslog /var/log/messages

(...)

ubuntu1 nrpe[3023]: Error: Request contained command arguments!

ubuntu1 nrpe[3023]: Client request was invalid, bailing out...

In this situation, you need to make sure that you enable arguments or change the Nagios configuration not to use arguments over NRPE. Another possibility is that the check returns CHECK_NRPE: Socket timeout after 10 seconds or a similar message. In this case, the check command has not been completed within the configured time. You may need to increase the command_timeout in the NRPE configuration.

Summary

Thus in this article, we have learnt about troubleshooting Nagios 3.0 which includes troubleshooting the web interface, passive checks, SSH-Based checks, and NRPE.The article includes various possible errors along with their solutions and detailed explanations for each error listed out.

If you have read this article you may be interested to view :

 

    Learning Nagios 3.0
Learning Nagios 3.0
  • A comprehensive configuration guide to monitor and maintain your network and systems
  • Secure and monitor your network system with open-source Nagios version 3
  • Set up, configure, and manage the latest version of Nagios
  • In-depth coverage for both beginners and advanced users

http://www.packtpub.com/guide-for-learning-nagios-3/book

 

About the Author

Wojciech Kocjan Wojciech Kocjan is a system administrator and a programmer with 10 years of experience. His expertise includes managing Linux, Sun, and IBM servers. He also has several years of experience in a variety of open source projects.

 

Books from Packt

trixbox CE 2.6
trixbox CE 2.6

    Asterisk Gateway Interface 1.4 and 1.6 Programming
    Asterisk Gateway Interface 1.4 and 1.6 Programming

Magento: Beginner's Guide
Magento: Beginner's Guide

Joomla! E-Commerce with VirtueMart
Joomla! E-Commerce with VirtueMart

MODx Web Development
MODx Web Development

JBoss Drools Business Rules
JBoss Drools Business Rules

Drupal 6 Social Networking
Drupal 6 Social Networking

Django 1.0 Website Development
Django 1.0 Website Development

 

 

 

Code Download and Errata
Packt Anytime, Anywhere
Register Books
Print Upgrades
eBook Downloads
Video Support
Contact Us
Awards Voting Nominations Previous Winners
Judges Open Source CMS Hall Of Fame CMS Most Promising Open Source Project Open Source E-Commerce Applications Open Source JavaScript Library Open Source Graphics Software
Resources
Open Source CMS Hall Of Fame CMS Most Promising Open Source Project Open Source E-Commerce Applications Open Source JavaScript Library Open Source Graphics Software