CentOS High Availability

Chapter 1. Getting Started with High Availability

We live in a fast-paced world and, with all the technology surrounding us, we take it for granted most of the time. When we set the alarm clock in the evening before falling asleep, we never give much thought whether the alarm will actually work in the morning or not; and when we turn the ignition key to drive off to work, we never stop to think whether the car will fail to start or not. On a normal day, there is hardly any chance of anything like this happening to us, so we can calmly go to sleep in the evening. The same applies to visiting our favorite website first thing in the morning. We are, in fact, more likely to expect that the car will not start or the bus will be late than that we will not be able to log in to our Facebook or Gmail account. No wonder, these sites are is always online and ready to serve information whenever we request it.

Have you ever asked yourself, "How can this be?" We all know we cannot trust technology implicitly. Sooner or later, it can and it will fail. With such complex systems and technologies surrounding us, we are actually not aware how many systems are required to run flawlessly so that we can read our e-mails and check our Facebook walls. How did we become so sure that these complex systems will always provide what we require?

The answer to the question is high availability. Highly available systems are what made us all blind, in the belief that services are always there and on and failure is not an option. As the title of this book suggests, the objective is to familiarize you with how to achieve high availability, focusing on an actual, practical example of a three-node cluster configuration on CentOS Linux version 6 and 7. A three-node cluster is chosen because the number of cluster nodes plays a key role in the cluster configuration process, which will be explained in more detail later in the book. You will become familiar with two different software solutions available for achieving high availability on CentOS Linux.

In the first chapter, you will learn about high availability in general. It will start by laying the foundation and explaining what high availability is, also describing what system design approaches must be followed to make an IT system highly available. We will explain the meaning of computer clusters, why we need them, and the possible computer cluster configurations.

The emphasis of this book is on the following topics:

A practical, hands-on user guide
Cluster software installation and configuration
Cluster resource configuration and management
Cluster node management
Cluster failover testing

What is high availability?

The general meaning of the word "availability" is a characteristic of a resource—either a person or an object that can be accessed or used. Resource availability can be measured, and therefore, a ratio of the time a resource is accessible or usable to the time the resource is inaccessible or unusable can be calculated. Adding an adjective "high" to the word "availability" suggests that the resource should be accessible and usable most of the time during a given time interval. The term "high availability" is commonly used in information technology and it describes IT systems with a high level of availability.

High availability in IT refers to a system that is continuously operational and available for the delivery of services it provides for end users. The key point when talking about IT systems is the availability to deliver services to end users, since a system can be up-and-running from the IT administrator's perspective but can fail to provide services for end users, which makes it useless. There are a number of factors that can lead to service downtime, mainly because there are so many different layers that must work together to provide service availability.

An IT system usually consists of many different components. All of these components must be continuously available for a desirable length of time. It is needless to say that it is very important for these highly available systems to be properly designed, well thought through, and thoroughly tested with the goal of eliminating any possibility of potential failure. That being said, high availability is a system design approach, and a service implementation in a way, to provide the highest achievable level of performance and availability by eliminating all system-wide single points of failure.

Not every system can be marked highly available. It is common practice in IT to measure and calculate the availability of a system. Monitoring tools such as Nagios, Zenoss, or Zabbix can be used to provide reports on system availability and also alerts in the case of system unavailability. The measurements taken must reflect the actual availability of the system to the end user. By measuring and calculating the availability of a system, we can split them into systems that are classified as highly available and systems that are not. System availability is commonly expressed as a percentage of system uptime in a given year.

System design

IT systems that offer high availability of services must follow a specific system design approach by which they can provide the most available continuous operation. The fundamental rule of a high-availability system design approach is to avoid single points of failure. A single point of failure is a component of a system that could lead to system and service downtime if it fails. The design should avoid single points of failure, which makes the system more robust and automatically increases system and service availability.

A complex IT system providing application services can have a large number of single points of failure at different levels, but how can we eliminate all of them? The solution is redundancy. Redundancy means duplication of the system's critical components. Duplication of devices allows continuous system operation even if one of the duplicated devices fails. There are two types of redundancy: passive and active. Passive redundancy means using two or more devices while only one of them provides its service at certain point in time. The other devices wait to take over in the case of an unrecoverable failure of the operating device. Active redundancy means using two or more devices, all providing their service at all times. Even if one of the devices fails, other devices are continuously providing the service.

Let me try to explain single points of failure and redundancy with a practical example. Let's say you are hosting a simple website on your personal home computer. The computer is located at your home, hidden in your storage closet. It is happily providing a website service for end users. It is always on and users can access the website any time of the day. If you and I were ignorant, we could say that you are running a perfect solution with a perfect system design, especially since you are saving a fair amount of money, not paying for expensive hosting solutions at your local hosting service. But stop to think for a second and try to count the single points of failure in the system's design. Running a website on a personal computer that is not a dedicated server machine has a number of single points of failure to begin with. Personal computers are not designed to run continuously, mostly due to the fact that the hardware components of a personal computer are not duplicated and the redundancy requirements for high availability are not met.

If the hard drive on your personal computer fails, the system will crash and the website will experience serious downtime. The same goes for the computer's power supply. Unexpected failure of any of these components will bring the website down for anything ranging from an hour to days. The period of the downtime depends on the availability of the replacement component and the backup solution implemented. Another major issue with the system design in the provided example is the Internet Service Provider and the connection to the World Wide Web. Your personal computer is relying only on a single source to provide its Internet service and connection. If the Internet Service Provider, for some reason, suddenly experiences huge network problems and your Internet service goes down, the website will also experience serious downtime and you will be losing visitors—and possibly money—with every minute the website is unreachable. Again, the same goes for the electricity supply. You need to provide redundant components in every possible aspect, not just hardware. Redundancy must be provided at all layers, including the networking layer, power supply layer, and also the yet unmentioned application layer.

Nowadays the majority of modern server systems eliminate hardware single points of failure by duplicating hardware components, but this solution still falls short of eliminating single points of failure in applications, which is one of the main reasons for using computer cluster implementation. Application-layer redundancy is achieved with computer clusters. A computer cluster is a group of computers running cluster software that enables continuous two-way communication, also called a heartbeat, between cluster members. A heartbeat provides cluster members with information on the exact status of any cluster member at any given time. Practically, this means that any member of the cluster knows the exact number of the members in the cluster it is joined to and also knows which cluster members are active or online, in maintenance mode, offline, and many more aspects.

Computer clusters

A computer cluster is a group of computers joined to work together in order to provide high availability of some services. The services are usually built with a number of applications operating in the so-called application layer. As shown in the example from the previous section, single points of failure are spread across all layers of a system design, and the application layer is one of the critical layers. It is usual for an application to encounter an error, or bug, and stop responding or crash. Such a situation will lead to service downtime and probably financial loss as well, so it is necessary to provide redundancy on the application layer also. This is the reason we need to implement a computer cluster solution into our high-availability system design.

A computer cluster consists of two or more computers. The computers are connected to the local area network. The maximum number of computers in a computer cluster is limited by the cluster software solution implemented but, in general, common cluster solutions support at least 16 cluster members. It is good practice for the cluster members to have the same hardware and specifications, which means that the cluster computers consist of components from the same manufacturer and likely have the same resource specifications.

There are two common types of computer clusters:

Load balancing computer clusters
High-availability computer clusters

Load balancing computer clusters are used to provide better and higher performance of services, and are typically used in science for complex scientific measurements and calculations. Interestingly, the same clusters are also used for websites and web servers facing extremely high load, which helps improve the overall response of the website with load distribution to different cluster nodes.

High-availability computer clusters strive to minimize the downtime of the service provided and not so much to improve the overall performance of the service. The focus of this book is on high-availability clusters. There are many different cluster configurations, mainly depending on the number of cluster members and also on the level of availability you want to achieve.

Some of the different cluster configurations are as follows:

Active/Active: The Active/Active cluster configuration can be used with two or more cluster members. The service provided by the cluster is simultaneously active on all cluster nodes at any given time. The traffic can be passed to any of the existing cluster nodes if a suitable load balancing solution has been implemented. If no load balancing solution has been implemented, the Active/Active configuration can be used to reduce the time it takes to fail over applications and services from the failed cluster node.
Active/Passive: The Active/Passive cluster configuration can be used with two or more cluster members. At a given time, the service is provided only by the current master cluster node. If the master node fails, automatic reconfiguration of the cluster is triggered and the traffic is switched to one of the operational cluster nodes.
N + 1: The N over 1 cluster configuration can be used with two or more cluster members. If only two cluster members are available, the configuration degenerates to the Active/Passive configuration. The N over 1 configuration implies the presence of N cluster members in an active/active configuration with one cluster member in backup or hot standby. The standby cluster member is ready to take over any of the failed cluster node responsibilities at any given time.
N + M: The N over M cluster configuration can only be used with more than two cluster members. This configuration is an upgrade of the N over 1 cluster configuration where N cluster members are in Active/Active state and M cluster members are in backup or hot standby mode. This is often used in situations where active cluster members manage many services and two or more backup cluster members are required to fulfill the cluster failover requirements.
N-to-1: The N-to-1 cluster configuration is similar to the N over 1 configuration and can be used with two or more cluster members. If there are only two cluster nodes, this configuration degenerates to Active/Passive. In the N-to-1 configuration, the backup or hot standby cluster member becomes temporarily active for the time period of failed cluster node recovery. When the failed cluster node is recovered, services are failed over to the original cluster node.
N-to-N: The N-to-N cluster configuration is similar to the N over M configuration and can be used with more than two cluster nodes. This configuration is an upgrade of the N-to-1 configuration and is used in situations where the need for extra redundancy is required on all active nodes.

The objective of a high-availability computer cluster is to provide uninterrupted and continuous availability of the service provided by the applications running in the cluster. There are many applications available that provide all sorts of services, but not every application can be managed and configured to work in a cluster. You need to make sure that the applications you choose to run in your computer cluster are easy to start, stop, and monitor. In this way, the cluster software can make instant application status checks, starts, and stops.

A computer cluster cannot exist without a shared storage solution. It is quite clear that a cluster is useless if the cluster members have access only to their local storage and data. The cluster members must have access to some kind of shared storage solution that ensures that all cluster members are able to access the same data. Cluster member access to the same storage provides consistency throughout the computer cluster. There are many shared storage solutions available, of which the following are the most commonly used in computer clusters today:

Storage Area Network (SAN): This is a type of block-level data storage. It provides high-speed data transfers and Redundant Array of Inexpensive Disks (RAID) disk redundancy. The most commonly used SAN solutions are big storage racks of disk arrays with hundreds of disks and terabytes of disk space. These storage racks are connected to computers via optical fibers to achieve the highest possible performance. Computers can see the storage as locally attached storage devices and are free to create the desired file system on these devices.
Network-attached Storage (NAS): This is a type of file-level data storage. With an NAS solution, the file system is predefined and forced on the client computers. NAS also provides RAID disk redundancy and the capability to expand to hundreds of disks and terabytes of disk space. As the name suggests, NAS storage is connected to computers through the network and provides access to data using file-sharing protocols such as Network File System (NFS), Server Message Block (SMB), and Apple Filling Protocol (AFP). Computers see NAS storage as network drives that they can map locally.
Distributed Replicated Block Device (DRBD): This is probably the least expensive and most interesting solution. DRBD is a network-based mirrored disk redundancy. It is installed and configured on computers as a service and those computers can use their existing local storage to mirror and replicate data through the network. DRBD is a software solution that runs on the Linux platform, and was released in July 2007. It was quickly merged with the Linux kernel by 2009. It is probably the simplest and cheapest to implement into any system design as long as it is Linux-based.

High-availability solutions

Nowadays, there are many different high-availability solutions you can choose from. Some of them are implemented on completely different layers and offer different levels of high availability. Even though some of these solutions are fairly easy to implement and provide a significantly high level of availability, they cannot compete with complete high-availability solutions and fall short compared to them.

Virtualization is the first step to be mention. Virtualization has taken large steps forward in the last decade, and probably everyone is using it. It became really popular very quickly, mostly due to the fact that it significantly lowers the cost of server maintenance, boosts the server life cycle, and facilitates management of the servers. Almost every virtualization solution also provides an integrated high-availability option that allows virtualization hosts to be joined with virtualization clusters.

The solution instantly provides high availability of virtual machines running on top of it. Therefore, the virtualization high-availability solution is provided on the lower virtualization layer. However, this kind of high availability might still fall short in certain key points. It only manages to eliminate the hardware single points of failure. It does not implement high availability of the applications and services that the virtual machines are running.

Databases are also one of the components of IT infrastructure with built-in high-availability solutions. Database servers are known to provide mission-critical services, mostly due to the huge amount of important data they usually store. This is why common database server software products offer high-availability features bundled with the database server software. This means that it is possible to implement database high availability at the database application layer. Database server high availability can provide the top application layer with high availability, which does not cover all angles required for a complete high-availability solution.

A complete high-availability solution is considered to be a solution that is implemented at the operating system level and includes the application layer, of course backed up by hardware, power supply, and network connection redundancy. This kind of high-availability solution is resistant to hardware and application failures and allows you to achieve the highest possible availability of the service.