Packt+ | Advance your knowledge in tech

You're reading from OpenStack Essentials. - Second Edition

Product typeBook

Published inAug 2016

PublisherPackt

ISBN-139781786462664

Edition2nd Edition

Tools

OpenStack

Concepts

Cloud Computing

Author (1)

Dan Radez

Chapter 12. Monitoring

As an OpenStack cluster is scaled out, the number of moving parts that can get jammed increases. As you have seen, each server added to the cluster will run more than one service. Each of those services interacts and communicates with each other across the cluster, using different communication methods and unique endpoints for each service. This presents a complicated web of interdependence that can be very complicated to debug when something goes wrong. Monitoring all the moving parts can save a large amount of time and hassle in trying to figure out what has gone wrong when things stop working.

In this chapter, we will look at setting up monitoring for the cluster to help you have a detailed view of the general health of a running OpenStack cluster.

Monitoring defined

There are two classifications of monitoring, performance monitoring and availability monitoring. Performance monitoring shows the performance of what is being monitored over time. Availability monitoring show the status of what is being monitored at a point in time. Often, the same things are monitored, but the purposes of the two types of monitoring are different. As an example, if a server's CPU utilization was being monitored, availability monitoring checks the CPU utilization, and if it breaches a certain threshold, the monitoring alerts an operator that the utilization is high or may have remained high over the most recent checks. Performance monitoring keeps track of the CPU utilization in the longer term and most likely creates a graph to show the trend of CPU utilization on a server across days or weeks or longer.

In this chapter, we will focus on availability monitoring to be able to determine the current health of an OpenStack cluster based on the current status...

Installing Nagios

Nagios is an open source monitoring tool that is well known and widely accepted by system administrators. There are other OpenSource options available as well, such as Zabbix or Sensu. We won't be able to get into those here. Just know that they are available and can help your monitoring needs.

There are plans in the works for monitoring installation to be added to a Triple-O deployment. Keep watch on the community for progress that is made on this. For now, we will install Nagios and look at what configurations can be dropped in to monitor an OpenStack installation. Start by installing Nagios, setting it to start on boot and starting the service:

undercloud# sudo yum install nagios nagios-plugins-all nagios-plugins-nrpe -y
undercloud# sudo chkconfig nagios on
undercloud# sudo systemctl start nagios

When nagios is installed, it adds configuration to Apache to serve a web page for you to see the status. Open http://192.0.2.1/nagios/ in a web browser. The default username...

Monitoring methods

As you begin to design availability monitoring for your cloud, there are at least three schools of thought on the kinds of checks that should be executed. These should be mixed and matched as you deem appropriate to establish the coverage you need to monitor the services in your OpenStack cluster. You may also come across other methods of designing health checks that can be mixed with what is discussed in this chapter.

The first type of check is the service status check. This type of check runs a simple Linux service status check on each of the services. If the service status script returns successfully that the service is running, the health check is successful. The problem with relying on these is that many OpenStack services have the ability to automatically heal from a loss of communication with each other. You can run a service check on an OpenStack service that is up and running but is actively attempting to reconnect to the database or to the message bus. OpenStack...

Non-OpenStack service checks

We are not going to cover generic non-OpenStack service checks in depth here. There is plenty of information you can search for on the Internet that can guide you on generic service checks. We will put these and the OpenStack service checks into /etc/nagios/conf.d/nagios_service.cfg. For OpenStack, it is important to at least add a host load and a disk usage check for each host. OpenStack can consume an excessive amount of disk space and processor load, and the whole cluster can become cranky very quickly if either is used beyond one of the hosts' capacity. There are many other generic checks that can and maybe should be added to your OpenStack hosts, though you will have to research others and choose the checks that you deem advantageous. Here are examples of the configurations for checking the load and disk space on /var:

define service { 
check_command check_nrpe!load5 
host_name control 
normal_check_interval 5 
service_description 5 minute load average 
use...

Monitoring control services

The control tier of an OpenStack cloud has the most moving parts that will need to be monitored. There are a few services that need at least a basic service connection validation. They include, but are not limited to, MySQL, RabbitMQ, and MongoDB. More monitoring can certainly be added beyond simple connection checks to monitor connections, queue sizes, and other statistics of the services. For now, we'll just add a connection check to make sure that these services are running:

define service { 
check_command check_mysql!nagios! nagios_password
host_name control 
service_description MySQL Health check 
use generic-service 
}
define service { 
check_command check_nrpe!check_rabbitmq_aliveness 
host_name control 
service_description RabbitMQ service check 
use generic-service 
}
define service { 
check_command check_nrpe!check_mongod_connect
host_name control 
service_description MongoDB service check 
use generic-service 
}

You can get the scripts for Rabbit and...

Monitoring network services

Next, let's take a look at monitoring networking services. Networking services in general usually stay running, and things that go wrong are happening inside the running service. We will go ahead and put a service status check on each of them and add additional checks to make sure things are working across the board. Start with giving each of the network services a service status check – the same checks that the control services got:

neutron-dhcp-agent
neutron-l3-agent
neutron-lbaas-agent
neutron-metadata-agent
neutron-metering-agent
neutron-openvswitch-agent
neutron-ovs-cleanup
openvswitch

Now, let's look at what can be monitored to make sure that when these services say that they are running, the network service is actually running. The configuration we have used in this book uses VXLAN tunnels to build overlay networks for OpenStack tenants. What this means is that each compute node is connected to the network node and to each other with VXLAN tunnels that encapsulate...

Monitoring compute services

The final set of services to monitor are those on the compute node. Here, you can make sure a couple of services are running and add the ping from the section you just finished. Start with the generic service status check for these services:

neutron-openvswitch-agent
openvswitch
neutron-ovs-cleanup
openstack-ceilometer-compute
openstack-nova-compute

Then add a service configuration to Nagios that will run the ping command to check your tunnel connectivity:

define service {
check_command check_nrpe!check_ovs_tunnel
host_name compute
service_description OVS tunnel connectivity
use generic-service
}

As you can see, this is just an NRPE check command that will execute a ping from the compute node to the network node.

Summary

As a final word of caution, remember that successful health checks across a cluster do not equate to a positive end user experience. Make sure to be in communication with end users about their experience, and use the cluster for your own purposes to ensure you are familiar with the experience the end user is receiving.

In this chapter, we have gone through a list of items that should be checked to monitor the health of an OpenStack cluster; this list is not exhaustive though. The best practice is to keep an eye out for possible points of failure and add checks that make sure that something that could potentially degrade services is monitored for its health.

The last topic for us to cover is troubleshooting. When these health checks start to alert, how should you go about diagnosing the problem and resolving the issue? In the last chapter, we will take a look at how to troubleshoot each of OpenStack's components.

The rest of the chapter is locked

You have been reading a chapter from

OpenStack Essentials. - Second Edition

Published in: Aug 2016Publisher: PacktISBN-13: 9781786462664

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at £13.99/month. Cancel anytime

Author (1)

Dan Radez

Dan Radez joined the OpenStack community in 2012 in an operator role. His experience is focused on installing, maintaining, and integrating OpenStack clusters. He has been given the opportunity to internationally present OpenStack content to a range of audiences of varying expertise. In January 2015, Dan joined the OPNFV community and has been working to integrate RDO Manager with SDN controllers and the networking features necessary for NFV. Dan's experience includes web application programming, systems release engineering, and virtualization product development. Most of these roles have had an open source community focus to them. In his spare time, Dan enjoys spending time with his wife and three boys, training for and racing triathlons, and tinkering with electronics projects.
Read more about Dan Radez

Personalised recommendations for you

Based on your interests and search pattern

Designing and Implementing Microsoft Azure Networking Solutions

Designing and Implementing Microsoft Azure Networking Solutions Exam Ref AZ-700 is an all-encompassing guide to the AZ-700 exam and contains all the information you need to succeed in the world of virtual networking with Azure. With this book, you will be fully prepared for the exam and the world of cloud networking.

BookAug 2023524 pages

Microsoft 365 Security, Compliance, and Identity Administration

The Microsoft 365 Security, Compliance, and Identity Administration is a comprehensive guide that helps you employ Microsoft 365's robust suite of features and empowers you to optimize your administrative tasks.

BookAug 2023630 pages

Zero Trust Overview and Playbook Introduction

Get started on Zero Trust with this step-by-step playbook and learn everything you need to know for a successful Zero Trust journey with tailored guidance for every role, covering strategy, operations, architecture, implementation, and measuring success. This book will become an indispensable reference for everyone in your organization.

BookOct 2023240 pages

The Self-Taught Cloud Computing Engineer

This self-study book helps you master multiple clouds, including AWS, Azure, and GCP, and serves as a roadmap to becoming a certified cloud computing expert. The book will guide you to develop a professional cloud career by helping you build a broad cloud knowledge base, developing hands-on cloud computing skills, and getting cloud certified.

BookSep 2023472 pages

Technology Operating Models for Cloud and Edge

This book will help you build and create ownership of a technology operating model, as well as connect your leadership with engineering and operations, keeping your internal and external customers in mind. It provides practical tips on why, where, and how to make the cloud and edge platform paradigm sing for you, your team, and your organization.

BookAug 2023228 pages

Azure Architecture Explained

Azure is the preferred platform to build mission-critical and secure apps. This book provides comprehensive coverage of essential Azure products, services, and solutions vital for every solution architect's success. Elevate your knowledge and master the critical components of Azure to excel in your role with Azure Architecture Explained.

BookSep 2023446 pages

Pentesting Active Directory and Windows-based Infrastructure

This practical guide helps you explore the pentesting of Microsoft infrastructure in detail, and enhances your offensive skillset by showing you the different ways to perform security assessment. This book will help blue teamers and IT engineers get up to speed with possible security issues they may encounter in their Windows environments.

BookNov 2023360 pages

Practical Ansible

In Practical Ansible, you'll work with the latest release of Ansible and learn to solve complex issues quickly with the help of task-oriented scenarios. You'll start by installing and configuring Ansible to automate monotonous and repetitive IT tasks and get to grips with concepts such as playbooks, inventories, plugins, collections, and network modules.

BookSep 2023420 pages

Windows 11 for Enterprise Administrators

Microsoft’s launch of Windows 11 is a step toward satisfying the enterprise administrator’s needs for better management and enhanced user experience customization. This book provides the enterprise administrator with the knowledge needed to fully utilize the advanced feature set of Windows 11 Enterprise.

BookOct 2023286 pages

The Linux DevOps Handbook

This book is for software and IT professionals seeking knowledge on Linux systems and DevOps practices. This book will provide you with guidance and tools to learn and gain proficiency in managing Linux-based infrastructures and knowledge of DevOps.

BookNov 2023428 pages2