Week 6
Cluster and High Availability
OPS3 - Virtualization and Cloud Infrastructure
Welcome to Week 6!
What You'll Learn This Week
1. Creating a Cluster
1.1 Initialization
- When you initialize a cluster, Proxmox performs several critical actions.
- It generates a cryptographic key (/etc/corosync/authkey) appearing to secure communication and creates the central configuration database (/etc/pve/corosync.conf).
- This database is essentially the "source of truth" for the entire cluster.
1.2 Joining a Node
- Adding a second server is a "join" operation, not a creation operation.
- You instruct the new node to connect to the existing ring.
- The new node authenticates using the root password or an explicit join token, downloads the cluster keys and configuration files, and restarts its local services to synchronize with the quorum.
- Crucial Requirement: For a cluster to function correctly, every node must have a unique Hostname and a persistent Static IP address.
- If an IP address changes after the cluster is formed, Corosync communication will break, causing the node to lose quorum and effectively disconnecting it from the datacenter.
Section 1 Checkpoint
Summary:
- pvecm is the primary command-line tool for managing the cluster lifecycle, wrapping the underlying Corosync engine.
- Cluster Requirements are strict: nodes must have unique hostnames, static network configurations, and a reliable low-latency network connection.
- Joining involves a node authenticating to an existing cluster leader to download shared keys and configuration state.
Reflection:
- Why does Proxmox use SSH keys for cluster communication alongside Corosync keys?
- What is the impact on the cluster configuration file /etc/pve/corosync.conf if you change a node's IP address without updating it?
Resources:
- Proxmox VE Cluster Manager
2. Quorum: The Rule of Majority Algorithm
2.1 The Split Brain Condition
"Split Brain" is a catastrophic failure state in a clustered environment where network communication is severed between nodes, yet the nodes themselves remain operational.
Consider a two-node cluster (Node A and Node B) where the heartbeat connection fails:
- The Divergence: Node A cannot see Node B and assumes Node B has failed. Simultaneously, Node B assumes Node A has failed.
- The Conflict: Both nodes promote themselves to "Master" status and attempt to take ownership of the same resources (e.g., VM ID 100).
- The Consequence: Both nodes mount the same shared storage volume and attempt to write data concurrently.
- The Result: Since they are unaware of each other's write operations, they overwrite each other's filesystem journals, leading to irreversible data corruption within milliseconds.
Figure 3: Split Brain Scenario - A network cut leads to dual active masters ensuring data corruption without quorum logic
2.2 Quorum Logic
To prevent Split Brain, the Proxmox Cluster Manager (pvecm) enforces a strictly democratic requirement: operations can only proceed if a strict majority of nodes are present. The formula for this is (Total Votes / 2) + 1.
- In a 2-Node Cluster, there are 2 total votes.
- The majority needed is (2/2) + 1 = 2.
- This implies that if a single node fails, the survivor has only 1 vote.
- Since 1 is less than 2, Quorum is lost.
- The surviving node essentially "locks down," forcing the filesystem into Read-Only mode to prevent any possibility of corruption.
- In a 3-Node Cluster, there are 3 total votes.
- The majority needed is (3/2) + 1 = 2.5 (which rounds down to integer 2).
- If one node fails, the remaining two nodes have 2 votes.
- Since 2 equals 2, Quorum is maintained, and the cluster remains fully operational.
- This highlights the architectural best practice of always designing clusters with an ODD number of nodes (3, 5, 7) to allow for reliable tie-breaking.
Section 2 Checkpoint
Summary:
- Quorum enforces the "Rule of Majority" using the formula (Total/2)+1 to ensure only one part of a partitioned cluster remains active.
- Split Brain occurs when disconnected nodes both attempt to become Master, leading to guaranteed data corruption.
- Safety Mechanism: If Quorum is lost, the cluster automatically locks down to Read-Only mode to preserve data integrity.
Reflection:
- Why is a 2-node cluster considered "dangerous" without an external vote (QDevice)?
- Does a "Majority" mean 51% (more than half) or exactly half?
Resources:
3. High Availability (HA) Manager
3.1 Architecture Components
The HA system is composed of two primary agents that work in tandem to maintain service availability.
- The Cluster Resource Manager (pve-ha-crm) acts as the "Cluster Manager" or the "Boss." It runs as a single active instance on the current master node.
- Its job is to maintain the state of the cluster and make high-level decisions about where services should live.
- If the node running the active CRM fails, the cluster automatically elects a new master to take over this role.
- The Local Resource Manager (pve-ha-lrm) acts as the "Worker." An instance runs on every single node in the cluster.
- It receives orders from the CRM to start or stop services and reports the status of local resources back to the master.
- It is responsible for the actual execution of service management commands on the local hypervisor.
3.2 Fencing Mechanism
The HA mechanism relies on absolute certainty. Before the cluster can steal VMs from a non-responsive node, it must be 100% sure that the node is truly dead.
- If Node A stops responding to heartbeats, Node B cannot know if Node A has crashed or if just the network cable was unplugged.
- If Node B starts Node A's VMs while Node A is still running them, both nodes would attempt to write to the same virtual disks simultaneously, guaranteeing severe data corruption.
- To solve this, we use Fencing, often referred to by the acronym STONITH (Shoot The Other Node In The Head).
- Upon detecting a failure, the cluster issues a command to a physical hardware device (like an IPMI controller or a Smart PDU) to physically cut power to the faulty node.
- This guarantees the node is dead.
- Only after this confirmation does the cluster restart the VMs on healthy nodes.
Figure 5: The Fencing Process - How the cluster physically isolates a failed node before recovering its workloads
Section 3 Checkpoint
Summary:
- High Availability (HA) automates the recovery of services by restarting VMs on healthy nodes after a hardware failure.
- CRM and LRM act as the "Manager" and "Worker" services, respectively, to orchestrate the monitoring and recovery process.
- Fencing (STONITH) is the essential safety mechanism that physically powers off a non-responsive node to prevent Split Brain before recovery begins.
Reflection:
- Why is Fencing (STONITH) safer than just assuming a silent node is down?
- Can you have High Availability without Shared Storage? (Consider the implications of ZFS Replication).
Resources:
4. Troubleshooting the Cluster
4.1 Check Quorum
- The first step in any cluster diagnosis is to verify the voting state.
- Run pvecm status to see the cluster's health from the perspective of the local node.
- Key fields to observe are Votes (number of nodes currently active) and Quorate.
- If Quorate is No, the cluster has lost its majority and will block any changes to the configuration database (pmxcfs) to prevent split-brain, effectively locking the cluster into a read-only mode.
4.2 Check Corosync
- If nodes are not syncing but the network appears up, the issue often lies with Corosync latency.
- Use systemctl status corosync to check the service health.
- The logs will reveal if the "token retransmit time" is being exceeded.
- Corosync requires extremely low latency (typically < 2ms) to function correctly.
- High latency links, such as Wi-Fi or saturated 1Gbps uplinks during backups, often cause Corosync to drop packets and declare nodes dead falsely.
4.3 Force Quorum (Emergency Only)
In a catastrophic scenario where you have a 2-node cluster and one node permanently fails, the survivor will lose quorum (1 vote < 2 required). To recover management capability on the survivor, you can artificially lower the expected vote count.
Warning: This command tells the survivor, "Pretend we only expected 1 vote." This allows it to become quorate alone. You must only do this if you are absolutely certain the other node is dead. If the other node comes back online while this is active, you will cause a Split Brain scenario.
Section 4 Checkpoint
Summary:
- pvecm status is the primary diagnostic tool for assessing voting health and determining if the cluster is Quorate.
- Corosync Latency is the most common cause of instability; high latency triggers false failure detection.
- Forcing Quorum (expected 1) is a destructive emergency measure to recover a surviving node in a broken cluster.
Reflection:
- Why is latency (Ping time) so critical for Corosync compared to bandwidth?
- What does "Quorate: No" actually mean for your ability to start, stop, or migrate VMs?
Resources:
- Clusterlabs Troubleshooting
5. Live Migration CLI
Section 5 Checkpoint
Summary:
- Live Migration moves active RAM state between nodes, allowing hardware maintenance without service interruption.
- --online ensures the VM remains responsive during the transfer; without it, the VM would hibernate and resume (offline migration).
- --with-local-disks enables migrations even without shared storage by copying the disk image alongside the RAM, though this takes significantly longer.
Reflection:
- What happens to the VM if the network cable is unplugged during the RAM copy phase of a migration?
- Why must the CPU Type often be set to kvm64 or host model carefully in heterogeneous clusters?
Resources:
- QEMU Migration Documentation
6. Enterprise Shared Storage Architectures
6.1 Distributed Storage: Ceph (Advanced)
While ZFS is the gold standard for local storage, modern data centers often span multiple servers. Ceph is a massively scalable, distributed, self-healing file system that runs across a cluster of Proxmox nodes.
6.1.1 Architecture Components
Ceph is not just software; it is a living ecosystem made of daemons:
- OSD (Object Storage Daemon): The workhorse. One OSD runs per physical disk. It handles reading, writing, and replicating data.
- MON (Monitor): The brain. It maintains the "Cluster Map"—the master list of which nodes are alive and where data lives. You usually need at least 3 MONs for quorum.
- MGR (Manager): Collects metrics and state for the GUI dashboard.
6.1.2 Implementation in Proxmox (HCI)
Proxmox VE is unique because it integrates Ceph directly into the hypervisor (Hyper-Converged Infrastructure). You do not need external storage servers. The architecture diagram below shows how OSDs, MONs, and MGRs work together across a Ceph cluster:
Figure 8: Ceph Distributed Storage - OSDs manage disks, MONs maintain cluster maps, and MGRs collect metrics across multiple nodes
6.2 External Shared Storage (SAN & NAS)
While Ceph is great for internal storage, many enterprises already have massive external storage arrays (SANs). Proxmox connects to these using standard protocols.
- Network Attached Storage (NAS): Uses NFS or SMB. The storage array manages the filesystem. Proxmox simply mounts a folder. It's easy, but effectively "Serial" (files are locked individually).
- Storage Area Network (SAN): Uses iSCSI or Fibre Channel. Proxmox sees a raw block device over the network.
- Parallel / Cluster File Systems: To allow multiple Proxmox nodes to mount the same SAN LUN simultaneously and write to it without corrupting data, we use a Clustered File System like GFS2 (Global File System 2) or OCFS2.
Locking: These systems use a specialized Distributed Lock Manager (DLM) to ensure that if Node A is writing to a file, Node B knows about it instantly.
- LVM-Shared: Alternatively, Proxmox often uses LVM on top of iSCSI in "Shared Mode" to manage raw disk volumes for VMs without a full filesystem layer.
Section 6 Checkpoint
Summary:
- Ceph (HCI): Distributed, self-healing storage on compute nodes (3+ nodes req).
- SAN/NAS: External storage arrays. Block (iSCSI/FC) vs File (NFS).
- Cluster FS: GFS2/OCFS2 needed for simultaneous shared writes.
Reflection:
- Why is a 10GbE network mandatory for Ceph?
- What happens if two servers write to a standard ext4 non-clustered disk at the same time?
Resources:
7. Proxmox Backups (VZDump)
7.1 Backups (VZDump) vs Snapshots
- Snapshot: A point-in-time "difference file" linked to the original disk. Dependent.
- Backup (VZDump): A comprehensive, independent archive (config + compressed disk data, e.g., .vma.zst). It can be moved offsite for disaster recovery.
Figure 9: Snapshot vs. Backup - Snapshots are dependent save points for testing; Backups are independent archives for disaster recovery
7.2 Proxmox Backup Modes
When performing a backup, the state of the VM determines the consistency of the data.
Figure 10: Proxmox Backup Modes - Live (Snapshot), Suspend (Frozen), and Stop (Consistent) modes balance uptime vs. data consistency
Section 7 Checkpoint
Summary:
- Backup: Independent Archive, essential for DR.
- Modes: Snapshot (Live), Suspend (Frozen), Stop (Consistent).
8. Managing Storage via CLI (pvesm)
Section 8 Checkpoint
Summary:
- pvesm helps when the GUI is unavailable.
- storage.cfg is the cluster-wide storage definition file.
9. Additional Resources
10. Lab Exercises
Summary
Review the key concepts covered in this week's material
Questions?