Reader small image

You're reading from  Mastering Proxmox, - Third Edition

Product typeBook
Published inNov 2017
PublisherPackt
ISBN-139781788397605
Edition3rd Edition
Right arrow
Author (1)
Wasim Ahmed
Wasim Ahmed
author image
Wasim Ahmed

Wasim Ahmed, born in Bangladesh and now a citizen of Canada, is a veteran of the IT world. He first came into close contact with computers in 1992 and never looked back. Wasim has a deep understanding of networks, virtualization, big data storage, and network security. By profession, Wasim is the CEO of a global IT support and cloud service provider based in Calgary, Alberta. He serves many companies and organizations through his company on a daily basis. Wasim's strength comes from his experience, which comes from learning and serving continually. Wasim strives to find the most effective solution at the most competitive price. He has built over 20 enterprise production virtual infrastructures using Proxmox and the Ceph storage system. Wasim and his team are notorious for not simply accepting a technology based on its description alone, but putting it through rigorous testing to check its validity. Any new technology that his company provides goes through months of continuous testing before it is accepted. Proxmox made the cut superbly.
Read more about Wasim Ahmed

Right arrow

Chapter 16. Rescuing Proxmox

Whether we want to accept it or not, a network environment is always at risk of something going wrong. Even if we take out the hardware and software from the equation, there is always the human factor. Sometimes all it takes is a small mistake that can snowball very rapidly to something major. A well thought out disaster plan can go a long way to combating a situation, or sometimes on the ball quick thinking can save the day.

As we approach the end of the book, in this concluding chapter we are going to see some situations where things went wrong and what do to do when the same happens to you in the virtual environment you are part of. Like Chapter 15, Proxmox Troubleshooting, these are not all-inclusive scenarios. You may, or will, come across other situations that are not covered in this chapter. As a good administrator, you can expand on this through your own documentation, but we hope we were able to put together some critical situations that you may face...

Recovering from OS drive failure


OS drive failure is one of the critical failures when a node becomes fully inaccessible. Since Proxmox stores all cluster-related configuration files on Proxmox Cluster file system (pmxcfs), no cluster data is lost even when the OS drive fails completely. Refer to Chapter 3, Proxmox under the Hood, to recap details on pmxcfs. There are mainly two types of OS drive failure:

  • Physical drive failure
  • OS data corruption

Physical drive failure

This failure occurs when the physical drive itself becomes completely unusable or defective. In this scenario, the only option is to replace the damaged drive with a new one and install clean Proxmox VE on it. One way to prevent downtime due to physical drive failure is to use two physical drives for the OS in mirror mode. During Proxmox installation, we can select the Advanced option to create a ZFS mirror on two physical drives. This way when one drive becomes physically damaged, it does not cause any downtime since there is...

Recovering from a quorum failure


There are various reason why a Proxmox cluster can lose a quorum. For the cluster to operate correctly, a quorum must exist within the nodes. A quorum is established when the majority of the nodes are online. If 51% of the nodes go offline for whatever reason, a quorum will be lost, resulting in a cluster error. A Proxmox quorum relies on multicast. So if multicast gets disabled in the switch, the cluster can also lose a quorum. A manual misconfiguration in the cluster file can also cause loss of a quorum. When a quorum is lost, the following error messages will appear in log files under /var/log/corosync:

......................
corosync[9999]:  [QUORUM] Quorum provider: corosync_votequorum failed to initialize.
corosync[9999]:  [SERV  ] Service engine 'corosync_quorum' failed to load for reason
    'configuration error: nodelist or quorum.expected_votes must be configured!'
......................

The previous error may be because the hostname of the node could...

Recovering from a node failure


A Proxmox node can physically fail due to hardware component failure such as the motherboard, CPU, memory, power supply and so on, while the OS drive remains intact. In such a scenario, we can simply move the OS drive to a different node and power up. The new node does not need to be identical to the faulty one at all. Since the network interface may be different, we only will need to ensure the network configuration is set for the proper interface. Also, if the Proxmox OS has a paid subscription, the key will need to be reissued. Contact the seller where the subscription was purchased from or Proxmox directly to get the subscription key reissued.

The subscription key is bound to the hardware component, so the reissue of the key is required to bind the subscription key to the new hardware component. It is important to note that the CPU count will matter when moving the OS drive from one to another with a paid subscription. A Proxmox subscription key purchased...

Recovering from a network failure


The extent of network failure can span over multiple layers, causing interruption between the Proxmox node and the user, or between the storage node and Proxmox nodes. The failure can occur due to physical network interface failure or an accidental network cable pull from nodes. The network failure can also occur due to heavy network traffic, which may be caused by but not limited to running a backup task on the same network path. In most production environments, server nodes usually contain more than one network interface for redundancy to reduce the loss of network connectivity to a minimum. The three most common scenarios for network connectivity interruptions are explained in the following sections. 

Loss of connectivity between Proxmox nodes

In this scenario, network connectivity is only interrupted between Proxmox nodes in a cluster. When over half of the Proxmox nodes in a cluster cannot communicate with each other, a quorum cannot be established. If...

Recovering from Ceph failure


Ceph is a very resilient, highly available storage system. Once a Ceph cluster is configured, for the most part, it can run maintenance free. In most cases, lack of knowledge on how Ceph works leads to major issues, causing cluster-side interference. In this section, we will highlight some of the most common issues and how to combat them in a Ceph cluster. 

Best practices for a healthy Ceph cluster

The following are a few best practices to keep a Ceph cluster running healthy:

  • If possible, keep all settings to default for a healthy cluster.
  • Use Ceph pool only to implement a different OSD type policy and not for multitenancy, such as one pool for SSDs and another for HDDs.
  • Do not make frequent Ceph configuration changes. It adds extra workload on the cluster OSDs, reducing the life of HDDs. After each change, let the cluster rebalance data before making new changes. 
  • Always keep in mind the core count of Ceph nodes when adjusting Ceph threads. Do not let the number of...

Summary


In this chapter, we got to see some of the most common scenarios when things can go wrong and some steps to recover from them. By no means are these the only issues that can bring a cluster down. This list should be expanded through proper documentation as new issues surface and solutions are found. 

No amount of reading or study can equal hands-on experience with Proxmox. You may already be a professional in the virtualization field, or you may be just starting out on a networking career and looking for a way to stand out from the crowd, but hopefully, this book will push you in the right direction. Besides the official site and forum, you can also reach out to the author directly to ask questions or to have a discussion, through the author maintained forum at http://www.masteringproxmox.com/.

 

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Mastering Proxmox, - Third Edition
Published in: Nov 2017Publisher: PacktISBN-13: 9781788397605
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Wasim Ahmed

Wasim Ahmed, born in Bangladesh and now a citizen of Canada, is a veteran of the IT world. He first came into close contact with computers in 1992 and never looked back. Wasim has a deep understanding of networks, virtualization, big data storage, and network security. By profession, Wasim is the CEO of a global IT support and cloud service provider based in Calgary, Alberta. He serves many companies and organizations through his company on a daily basis. Wasim's strength comes from his experience, which comes from learning and serving continually. Wasim strives to find the most effective solution at the most competitive price. He has built over 20 enterprise production virtual infrastructures using Proxmox and the Ceph storage system. Wasim and his team are notorious for not simply accepting a technology based on its description alone, but putting it through rigorous testing to check its validity. Any new technology that his company provides goes through months of continuous testing before it is accepted. Proxmox made the cut superbly.
Read more about Wasim Ahmed