Reader small image

You're reading from  OpenStack Essentials. - Second Edition

Product typeBook
Published inAug 2016
PublisherPackt
ISBN-139781786462664
Edition2nd Edition
Right arrow
Author (1)
Dan Radez
Dan Radez
author image
Dan Radez

Dan Radez joined the OpenStack community in 2012 in an operator role. His experience is focused on installing, maintaining, and integrating OpenStack clusters. He has been given the opportunity to internationally present OpenStack content to a range of audiences of varying expertise. In January 2015, Dan joined the OPNFV community and has been working to integrate RDO Manager with SDN controllers and the networking features necessary for NFV. Dan's experience includes web application programming, systems release engineering, and virtualization product development. Most of these roles have had an open source community focus to them. In his spare time, Dan enjoys spending time with his wife and three boys, training for and racing triathlons, and tinkering with electronics projects.
Read more about Dan Radez

Right arrow

Chapter 12. Monitoring

As an OpenStack cluster is scaled out, the number of moving parts that can get jammed increases. As you have seen, each server added to the cluster will run more than one service. Each of those services interacts and communicates with each other across the cluster, using different communication methods and unique endpoints for each service. This presents a complicated web of interdependence that can be very complicated to debug when something goes wrong. Monitoring all the moving parts can save a large amount of time and hassle in trying to figure out what has gone wrong when things stop working.

In this chapter, we will look at setting up monitoring for the cluster to help you have a detailed view of the general health of a running OpenStack cluster.

Monitoring defined


There are two classifications of monitoring, performance monitoring and availability monitoring. Performance monitoring shows the performance of what is being monitored over time. Availability monitoring show the status of what is being monitored at a point in time. Often, the same things are monitored, but the purposes of the two types of monitoring are different. As an example, if a server's CPU utilization was being monitored, availability monitoring checks the CPU utilization, and if it breaches a certain threshold, the monitoring alerts an operator that the utilization is high or may have remained high over the most recent checks. Performance monitoring keeps track of the CPU utilization in the longer term and most likely creates a graph to show the trend of CPU utilization on a server across days or weeks or longer.

In this chapter, we will focus on availability monitoring to be able to determine the current health of an OpenStack cluster based on the current status...

Installing Nagios


Nagios is an open source monitoring tool that is well known and widely accepted by system administrators. There are other OpenSource options available as well, such as Zabbix or Sensu. We won't be able to get into those here. Just know that they are available and can help your monitoring needs.

There are plans in the works for monitoring installation to be added to a Triple-O deployment. Keep watch on the community for progress that is made on this. For now, we will install Nagios and look at what configurations can be dropped in to monitor an OpenStack installation. Start by installing Nagios, setting it to start on boot and starting the service:

undercloud# sudo yum install nagios nagios-plugins-all nagios-plugins-nrpe -y
undercloud# sudo chkconfig nagios on
undercloud# sudo systemctl start nagios

When nagios is installed, it adds configuration to Apache to serve a web page for you to see the status. Open http://192.0.2.1/nagios/ in a web browser. The default username...

Monitoring methods


As you begin to design availability monitoring for your cloud, there are at least three schools of thought on the kinds of checks that should be executed. These should be mixed and matched as you deem appropriate to establish the coverage you need to monitor the services in your OpenStack cluster. You may also come across other methods of designing health checks that can be mixed with what is discussed in this chapter.

The first type of check is the service status check. This type of check runs a simple Linux service status check on each of the services. If the service status script returns successfully that the service is running, the health check is successful. The problem with relying on these is that many OpenStack services have the ability to automatically heal from a loss of communication with each other. You can run a service check on an OpenStack service that is up and running but is actively attempting to reconnect to the database or to the message bus. OpenStack...

Non-OpenStack service checks


We are not going to cover generic non-OpenStack service checks in depth here. There is plenty of information you can search for on the Internet that can guide you on generic service checks. We will put these and the OpenStack service checks into /etc/nagios/conf.d/nagios_service.cfg. For OpenStack, it is important to at least add a host load and a disk usage check for each host. OpenStack can consume an excessive amount of disk space and processor load, and the whole cluster can become cranky very quickly if either is used beyond one of the hosts' capacity. There are many other generic checks that can and maybe should be added to your OpenStack hosts, though you will have to research others and choose the checks that you deem advantageous. Here are examples of the configurations for checking the load and disk space on /var:

define service { 
check_command check_nrpe!load5 
host_name control 
normal_check_interval 5 
service_description 5 minute load average 
use...

Monitoring control services


The control tier of an OpenStack cloud has the most moving parts that will need to be monitored. There are a few services that need at least a basic service connection validation. They include, but are not limited to, MySQL, RabbitMQ, and MongoDB. More monitoring can certainly be added beyond simple connection checks to monitor connections, queue sizes, and other statistics of the services. For now, we'll just add a connection check to make sure that these services are running:

define service { 
check_command check_mysql!nagios! nagios_password
host_name control 
service_description MySQL Health check 
use generic-service 
}
define service { 
check_command check_nrpe!check_rabbitmq_aliveness 
host_name control 
service_description RabbitMQ service check 
use generic-service 
}
define service { 
check_command check_nrpe!check_mongod_connect
host_name control 
service_description MongoDB service check 
use generic-service 
}

You can get the scripts for Rabbit and...

Monitoring network services


Next, let's take a look at monitoring networking services. Networking services in general usually stay running, and things that go wrong are happening inside the running service. We will go ahead and put a service status check on each of them and add additional checks to make sure things are working across the board. Start with giving each of the network services a service status check – the same checks that the control services got:

neutron-dhcp-agent
neutron-l3-agent
neutron-lbaas-agent
neutron-metadata-agent
neutron-metering-agent
neutron-openvswitch-agent
neutron-ovs-cleanup
openvswitch

Now, let's look at what can be monitored to make sure that when these services say that they are running, the network service is actually running. The configuration we have used in this book uses VXLAN tunnels to build overlay networks for OpenStack tenants. What this means is that each compute node is connected to the network node and to each other with VXLAN tunnels that encapsulate...

Monitoring compute services


The final set of services to monitor are those on the compute node. Here, you can make sure a couple of services are running and add the ping from the section you just finished. Start with the generic service status check for these services:

neutron-openvswitch-agent
openvswitch
neutron-ovs-cleanup
openstack-ceilometer-compute
openstack-nova-compute

Then add a service configuration to Nagios that will run the ping command to check your tunnel connectivity:

define service {
check_command check_nrpe!check_ovs_tunnel
host_name compute
service_description OVS tunnel connectivity
use generic-service
}

As you can see, this is just an NRPE check command that will execute a ping from the compute node to the network node.

Summary


As a final word of caution, remember that successful health checks across a cluster do not equate to a positive end user experience. Make sure to be in communication with end users about their experience, and use the cluster for your own purposes to ensure you are familiar with the experience the end user is receiving.

In this chapter, we have gone through a list of items that should be checked to monitor the health of an OpenStack cluster; this list is not exhaustive though. The best practice is to keep an eye out for possible points of failure and add checks that make sure that something that could potentially degrade services is monitored for its health.

The last topic for us to cover is troubleshooting. When these health checks start to alert, how should you go about diagnosing the problem and resolving the issue? In the last chapter, we will take a look at how to troubleshoot each of OpenStack's components.

lock icon
The rest of the chapter is locked
You have been reading a chapter from
OpenStack Essentials. - Second Edition
Published in: Aug 2016Publisher: PacktISBN-13: 9781786462664
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at £13.99/month. Cancel anytime

Author (1)

author image
Dan Radez

Dan Radez joined the OpenStack community in 2012 in an operator role. His experience is focused on installing, maintaining, and integrating OpenStack clusters. He has been given the opportunity to internationally present OpenStack content to a range of audiences of varying expertise. In January 2015, Dan joined the OPNFV community and has been working to integrate RDO Manager with SDN controllers and the networking features necessary for NFV. Dan's experience includes web application programming, systems release engineering, and virtualization product development. Most of these roles have had an open source community focus to them. In his spare time, Dan enjoys spending time with his wife and three boys, training for and racing triathlons, and tinkering with electronics projects.
Read more about Dan Radez