Learning OpenStack High Availability

4.5 (2 reviews total)
By Rishabh Sharma
  • Instant online access to over 7,500+ books and videos
  • Constantly updated with 100+ new titles each month
  • Breadth and depth in over 1,000+ technologies
  1. An Introduction to High Availability Concepts

About this book

OpenStack is one of the most popular open source cloud computing platforms, and it is used most of all for deploying Infrastructure as a Service (IaaS) solutions. Enabling high availability in OpenStack is a required skill for cloud administrators and cloud engineers in today’s world.

This book helps you to achieve high availability and resiliency to OpenStack. This means clustering, fencing, load-balancing, distributed networking, leveraging shared storage, automatic failover, and replication. We start with a basic understanding of what a highly available design is meant to achieve in OpenStack and various ways to achieve high availability in OpenStack through simple step-by-step procedures.

Through hands-on examples, you will develop a solid knowledge of horizontally-scalable, fault-resistant, and highly-available OpenStack clusters and will be able to apply the techniques from this book in your day-to-day projects. This book also sheds light on the principles of application design for high availability, and monitoring for high availability, with examples.

Publication date:
December 2015


Chapter 1. An Introduction to High Availability Concepts

Over the past couple of years, cloud computing has made a significant impact in transforming IT from a niche skill to a key element of enterprise production environments. From an Infrastructure as a Service (IaaS) point of view, cloud computing is much more advanced than mere virtualization; various industries and online businesses have started moving test, staging, and productions scenarios roles to IaaS and started replacing traditional dedicated resources with on-demand resource models.

OpenStack is one of the most popular and commonly used open source cloud computing platforms, and it is mainly used to deploy infrastructure as a service solution. Enabling high availability in OpenStack is a required skill for cloud administrators and cloud engineers. This chapter will introduce you to high availability concepts, a way of measuring and achieving high availability through architectural design in OpenStack.

In this chapter, we will cover the following topics:

  • What does high availability mean?

  • How to measure high availability

  • Architecture design for high availability

  • High availability in OpenStack


What does High Availability (HA) mean?

The basic understanding of high availability in the IT world is when any system continuously operates (100 percent operational) without any down time despite occurrences of failure in hardware, software, and application.


How to measure high availability

In order to measure a system's high availability, we usually check the total duration of the uptime of the system. For example if system availability is 99 percent, it means that the system is operational for 8672.4 hours throughout the year because the total hours in a year is 8760.

The following formula is used to calculate the total availability, and availability can be increased using high availability techniques to increase MTTF and decrease MTTR that we are going to discuss in this book with respect to OpenStack .Let's understand these term in more detail:

  • Mean time between failures (MTBF): MTBF tells us the estimated time between two frequent failures within a process or a component, which can be repairable.

  • Mean time to failure (MTTF): MTTF is the total estimated time of a system where repairing is not possible.

  • Mean time to repair or replace (MTTR): MTTR is the average estimated time to repair a failed component or the total replacement time of a failed component.

The formula to calculate availability is this:

Availability = MTTF / (MTTF + MTTR)

The following table represents the Service Level Agreement (SLA) for high availability between a consumer and a provider:

The level of availability




one 9s(90%)

144.00 minutes

72 hours

36.5 days

two 9s(99%)

14.40 minutes

7 hours

3.65 days

three 9s(99.9%)

86.40 seconds

43 minutes

8.77 hours

four 9s(99.99%)

8.64 seconds

4 minutes

52.60 minutes

five 9s(99.999%)

0.86 seconds

26 seconds

5.26 minutes

The following are some commonly used terms that are used to understand and measure high availability:

  • Single point of failure (SPOF): It reflects a part or a particular component of a system that brings the entire system to a halt; when it fails, the whole system will stop working. The consideration for a possibly single point of failure identifies the critical components of a complex system that would cause a total system failure in case of any breakdown.

  • Recovery time objective (RTO): It is a very important matrix to measure business continuity. RTO determines the tolerance of a business process when a system is not available. For high user traffic e-commerce websites and mission critical businesses, RTO should be zero or near to zero.

  • Recovery point objective (RPO): RPO analysis is a very critical metric for any business. It is calculated as a total amount of data loss when a system is not available. It is measured in terms of time.

  • Service level agreement (SLA): SLA is a mutual agreement between consumers and a service provider that defines detailed service offerings, delivery time, the quality of service (QoS), and the scope or constraints of the offered services.


Common content in the contract

Metrics related to performance assurance depend on the following components:

  • RTO and RPO

  • Uptime and downtime ratio

  • System throughput

  • Response time


How to achieve high availability

High availability in a production environment is achieved by removing single point of failures, enhancing replication, redundancy, and fail-over capability of the system, and quickly detecting failures as they occur. High availability can be achieved at many different levels including the infrastructure level, data center level, geographic redundancy level, and even application level. The basic understanding of high availability in the infrastructure level includes the following things:

  • Multiple web servers

  • Multiple database servers

  • Multiple load balancers

  • Leveraging different storage systems

Here, multiple web servers means multiple web nodes, multiple load balancers means active/passive load balancers, multiple database servers means replicated DB servers, and leveraging different storage system means the maximum utilization of all possible storage solutions with a backup plan that provides redundancy at each layer of the configuration.

For advanced HA configurations, we also need to plan about automatic failover and geo-replication, in case of disaster recovery, and also need to design our application for high availability. In short, we can say that there are various advanced techniques to achieve high availability, which we will discuss in detail in the following chapters, but the prime objective of each technique is to obtain the following five principals of high availability as described in this figure:

High availability principles


Architecture design for high availability

The following are some common design patterns used to design a high availability architecture. Let's discuss this in detail:

  • Design for failure consideration: This must be the first and foremost concern of a cloud architect. Whenever any company or organization decides to move to a cloud infrastructure solution, it has to plan for failure. If the failure conditions are planned properly, then this will be of little or no consequence to the HA services or the resources, and the system will always be available.

  • Decouple your components: All the components in a cloud infrastructure should be decoupled and isolated from each other; for example, whenever we have set upload balancing applications, traditionally web servers have been tightly coupled with the application server, but this is not the best practice. Isolation and decoupling are necessary between components. The following figure shows that each web server and application server is highly coupled:

    Decoupled applications

    Web servers and application servers can be loosely coupled and isolated by putting a load balancer between them, as shown in the following figure. Any web or application server can be easily scaled up or scaled down without any dependency.

    High coupling

  • Build security in every layer: While designing a cloud infrastructure, building security in each layer is recommended. As security is the shared responsibility of the consumer and the cloud provider, the following are some steps that a consumer must take to ensure security:

    • Enforce the principle of least privileges when developing applications

    • Encrypt data during transitions state

    • Try to use a multiway and multifactor authentication

  • Design parallel: While designing a cloud infrastructure, a parallel architecture that is fast and efficient should be implemented. Two possible approaches in parallel architecture could be as follows:

    • A server works on a job sequentially for 4 h

    • Four servers work on a job in parallel for 4 h

    There is no difference cost-wise, but the second approach is likely to be four times faster; therefore, a parallel design is highly recommended.

  • Automate and test everything: When a production environment is enabled to handle cloud service failure, you should test your automated processes for the occurrence of hardware, software, and application failure. This kind of automation is provided by the cloud infrastructure service provider, which allows you to implement failover automated processes across instances, availability zones, regions, and other clouds. You should automate the data backups as well because in case of any outage or disaster, your data must be immediately ready.

    Any disaster recovery plan can only be successful if it is tested properly to make sure it works. It can be tested by disabling your various cloud servers and associated services and transferring high loads to your production servers so that you can test the actual potential of your existing infrastructure. Significantly, cloud infrastructure solutions provide easily configurable disaster recovery and backup services nowadays. Despite highly popular services, organizations cannot effectively enable high availability services in the cloud until they implement a proper architecture and use the right management tools.


High availability in OpenStack

Many organizations go with the OpenStack cloud nowadays to provide an infrastructure as a service environment since OpenStack has excellent hybrid cloud features and can leverage public cloud scale out feature through APIs. Therefore, OpenStack is quite capable of handling the workload of mission critical applications.

OpenStack provides an easy-to-control dashboard to manage networking, storage, and computing resources and facilitates users to have a provision for computing resources. The following figure illustrates the architecture of OpenStack:

An OpenStack component

Organizations that run their production applications over OpenStack, generally try to achieve five 9s (99.999) of availability. Hence, the failure of any running controller node or service should not create any kind of disruption on any application running on the resources provided or managed by OpenStack.

In short, achieving high availability in OpenStack means removing all single points of failure, implementing redundancy, and replicating component failover capability and automation. Also, workload should be scaled out or scaled in according to real-time workload. We can achieve high availability in OpenStack in the following ways:

  • Enabling multimaster database replication and building an efficient and reliable messaging cluster that forms the basis of any highly available OpenStack deployment

  • Setting up open source load balancing software such as HAProxy and keepalived and load balancing of HTTP REST API's, MySQL, and AMQP clusters

  • Enabling some classical clustering methods such as a pacemaker with two or more nodes to control active/passive OpenStack services

  • Enabling OpenStack services in stateless mode to offer a scalable cloud framework with scaling and resiliency of all the basic OpenStack services: compute, image, storage, object storage, and dashboard

  • Leveraging third-party networking drivers that offer high availability options

  • Enabling distributed networking with Neutron Distributed Virtual Routers and configuring multiple L3 agents in active/passive configuration and third-party networking drivers that offer high availability options

  • Leveraging software-defined storage such as GlusterFS, Ceph, traditional enterprise storage such as NFS, iSCSI for quick recovery after a failure, and backup services

  • Enabling automatic failover and geo-replication using swift and effective recovery technique such as network partitioning split brain



In this chapter, we learned the fundamentals of high availability and what a highly available design is meant to achieve in a production environment.

We have also learned about SPOF, RTO, RPO MTTR, MTTF, MTTF, the concept of SLA, and the general architectural design of a highly available system with an overview of OpenStack high availability requirements and the various ways of achieving high availability in OpenStack.

In the next chapter, we will take a deep dive into database replication and how to build an efficient and reliable messaging cluster to enable high availability in OpenStack.

About the Author

  • Rishabh Sharma

    Rishabh Sharma is currently working as a chief technology officer (CTO) at JOB Forward, Singapore. Prior to working for JOB Forward, he worked for Wipro Technologies, Bangalore, as a solution delivery analyst. He was involved in research projects of cloud computing, proof of concepts (PoC), infrastructure automation, big data solutions, and various giant customer projects related to cloud infrastructure and application migration.

    In a short span of time, he has worked on various technologies and tools such as Java/J2EE, SAP(ABAP), AWS, OpenStack, DevOps, big data, and Hadoop. He has also authored many research papers in international journals and IEEE journals on a variety of issues related to cloud computing.

    He has authored five technical books until now. He recently published two books with international publications:

    Browse publications by this author

Latest Reviews

(2 reviews total)
This book brought me a hint.