In the market, there are many tools are available for monitoring, but the important things to keep in mind are as follows:
Nagios is a powerful monitoring system that provides you with instant awareness about your organization's mission-critical IT infrastructure.
By using Nagios, you can do the following:
Plan the release cycle and the rollouts, before things are outdated
Early detection, before it causes an outage
Have automation and a better response across the organization
Find hindrances in the infrastructure, which could impact the SLAs
The Nagios architecture was designed keeping in mind flexibility and scalability. It consists of a central server, which is referred to as the Monitoring Server and the clients are the Nagios agents, that run on each node that needs to be monitored.
The checks can be performed for service, port, memory, disk, and so on, by using either active checks or passive checks. The active checks are initiated by the Nagios server and the passive checks are initiated by the client. Its flexibility allows us to have programmable APIs and customizable plugins for monitoring.
Prerequisites for installing and configuring Nagios
Nagios is an enterprise class monitoring solution, which can manage a large number of nodes. It can be scaled easily, and it has the ability to write custom plugins for your applications. Nagios is quite flexible and powerful, and it supports many configurations and components.
Tip
Nagios is such a vast and extensive product that this chapter is in no way a reference manual for it. This chapter is written with the primary aim of setting up monitoring, as quickly as possible, and familiarizing the readers with it.
Always set up a separate host as the monitoring node/server and do not install other critical services on it. The number of hosts that are monitored can be a few thousand, with each host having from 15 to 20 checks that can be either active or passive.
Before starting with the installation of Nagios, make sure that Apache HTTP Server version 2.0 is running and gcc
and gd
have been installed. Make sure that you are logged in as root
or as with sudo
privileges. Nagios runs on many platforms, such as RHEL, Fedora, Windows, CentOS; however, in this book we will use the CentOS 6.5 platform.
Let's look at the installation of Nagios, and how we can set it up. The following steps are for Rhel, CentOS, Fedora, and Ubuntu:
Download Nagios and the Nagios plugin from the Nagios repository, which can be found at http://www.nagios.org/download/.
The latest stable version of Naigos at the time of writing this chapter was nagios-4.0.8.tar.gz
.
Create a Nagios user to manage the Nagios interface. You have to execute the commands as either root
or with sudo
privileges.
You can download it either from http://sourceforge.net/ or from any other commercial site, but a few sites might ask for registration.
Create a new nagcmd
group so that external commands can be submitted through the web interface.
If you prefer, you can download the file directly into the user's home
directory.
Create a Nagios user and an Apache user, as a part of the group.
Let's start with the configuration.
Navigate to the directory, where the package was downloaded. The downloaded package could be either in the Downloads
folder or in the present working directory.
Tip
On Red Hat, the . /configure
command might not work and might hang while displaying the message. So, add –enable-redhat-pthread-workaround
to the . /configure
command as a work-around for the preceding problem, as follows:
Web interface configuration
After installing Nagios, we need to do a minimal level of configuration. Explore the /usr/local/nagios/etc
directory for a few samples.
Update /usr/local/nagios/etc/objects/contacts.cfg
, with the e-mail address on which you want to receive the alerts.
Secondly, we need to configure the web interface through which we will monitor and manage the services. Install the Nagios web configuration file in the Apache configuration directory using the following command:
The preceding command will work only in the extracted directory of the Nagios. Make sure that you have extracted Nagios from the TAR
file and are in that directory.
Create an nagadm
account for logging into the Nagios web interface using the following command:
Reload apache, to read the changes, using the following command:
Open http://localhost/nagios/
in any browser on your machine.
If you see a message, such as Return code of 127 is out of bounds – plugin may be missing on the right panel, then this means that your configuration is correct as of now. This message indicates that the Nagios plugins are missing, and we will show you how to install these plugins in the next step.
Nagios provides many useful plug-ins to get us started with monitoring all the basics. We can write our custom checks and integrate it with other plug-ins, such as check_disk
, check_load
, and many more. Download the latest stable version of the plugins and then extract them. The following command lines help you in extracting and installing Nagios plugins:
After the installation of the core and the plug-in packages, we will be ready to start nagios
.
Before starting the Nagios service, make sure that there are no configuration errors by using the following command:
Start the nagios
service by using the following command:
There are many configuration files in Nagios, but the major ones are located under the /usr/local/nagios/etc
directory:
The other configuration files under the /usr/local/nagios/etc/objects
directory are described as follows:
The nagios.conf
file under /usr/local/nagios/etc/
is the main configuration file with various directives that define what all the files include. For example, cfg_dir=<directory_name>
.
Nagios will recursively process all the configuration files in the subdirectories of the directory that you specify with this directive as follows:
Setting up monitoring for clients
The Nagios server can do an active or a passive check. If the Nagios server proactively initiates a check, then it is an active check. Otherwise, it is a passive check.
The following are the steps for setting up monitoring for clients:
Download NRPE addon from http://www.nagios.org and then install check_nrpe
.
Create a host and a service definition for the host to be monitored by creating a new configuration file, /usr/local/nagios/etc/objects/clusterhosts.cfg
for that particular group of nodes.
Tip
Configuring a disk check
Communication among NRPE components:
On each of the client hosts, perform the following steps:
Install the Nagios Plugins and the NRPE addon, as explained earlier.
Create an account to run nagios
from, which can be under any username.
Install nagios-plugin
with the LD flags:
Change the ownership of the directories, where nagios
was installed by the nagios
user:
Install NRPE and run it as daemon:
Start the service, after creating the /et/xinet.d/nrpe
file with the IP of the server:
Modify the /usr/local/nagios/etc/nrpe.cfg
configuration file:
After getting a good insight into Nagios, we are ready to understand its deployment in the Hadoop clusters.
The second tool that we will look into is Ganglia. It is a beautiful tool for aggregating stats and plotting them nicely. Nagios gives the events and alerts, Ganglia aggregates and presents them in a meaningful way. What if you want to look for the total CPU, memory per cluster of 2000 nodes or total free disk space on 1000 nodes? Plotting the CPU memory for one node is easy, but aggregating it for a group on a node requires a tool that can do this.
Ganglia is an open source, distributed monitoring platform for collecting metrics across the cluster. It can do aggregation on CPU, memory, disk I/O, and many more components across a group of nodes. There are alternate tools, such as Cacti and Munin, but Ganglia scales very well for large enterprises.
Some of the key features of Ganglia are as follows:
We will now discuss some components of Ganglia.
Ganglia Monitoring Daemon (gmond
): It runs on the nodes that need to be monitored, and it captures the state change and sends updates to a central daemon by using XDR.
Ganglia Meta Daemon (gmetad
): It collects data from gmond
and the other gmetad
daemons. The data is indexed and stored on the disk in a round robin fashion. There is also a Ganglia front-end for a meaningful display of the information collected.
Let's begin by setting up Ganglia, and see what the important parameters that need to be taken care of are. Ganglia can be downloaded from http://ganglia.sourceforge.net/. Perform the following steps to install Ganglia:
Install gmond
on the nodes that need to be monitored:
Restart the Ganglia service:
Install gmetad
on the master node. It can be downloaded from http://ganglia.sourceforge.net/:
Update the gmetad.conf
file, which tells you where it will collect the data from along with the data source:
Update the gmond.conf
file on all the nodes so that they point to the master node, which has the same cluster name.