How to configure Pacemaker on Linux - High Availability & Clustering Guide

How to Configure Pacemaker on Linux Pacemaker is a powerful open-source cluster resource manager that provides high availability (HA) for Linux systems. It ensures that critical services remain operational even when individual nodes fail, making it an essential component for mission-critical environments. This comprehensive guide will walk you through the complete process of configuring Pacemaker on Linux, from initial setup to advanced resource management. Table of Contents 1. [Introduction to Pacemaker](#introduction-to-pacemaker) 2. [Prerequisites and Requirements](#prerequisites-and-requirements) 3. [Installation Process](#installation-process) 4. [Initial Cluster Configuration](#initial-cluster-configuration) 5. [Resource Configuration](#resource-configuration) 6. [Advanced Configuration Options](#advanced-configuration-options) 7. [Monitoring and Management](#monitoring-and-management) 8. [Troubleshooting Common Issues](#troubleshooting-common-issues) 9. [Best Practices and Tips](#best-practices-and-tips) 10. [Conclusion and Next Steps](#conclusion-and-next-steps) Introduction to Pacemaker Pacemaker is the brain of a Linux high availability cluster, working in conjunction with Corosync (the messaging layer) to provide automatic failover capabilities. It manages cluster resources such as IP addresses, file systems, databases, and applications, ensuring they remain available even when hardware or software failures occur. Key Components - Pacemaker: The cluster resource manager - Corosync: The cluster communication layer - Cluster Resource Agents: Scripts that manage specific services - Fencing Agents: Tools for isolating failed nodes Benefits of Using Pacemaker - High Availability: Automatic failover reduces downtime - Scalability: Support for multiple nodes and complex configurations - Flexibility: Extensive resource agent library - Monitoring: Built-in health checking and alerting - Standards Compliance: Follows industry best practices Prerequisites and Requirements Before configuring Pacemaker, ensure your environment meets the following requirements: System Requirements - Operating System: RHEL/CentOS 7+, Ubuntu 18.04+, SLES 12+, or Debian 9+ - Memory: Minimum 2GB RAM per node (4GB+ recommended) - Storage: At least 20GB available disk space - Network: Dedicated network interfaces for cluster communication - Time Synchronization: NTP or Chrony configured on all nodes Network Configuration ```bash Example network configuration for two-node cluster Node 1: cluster-node1 (192.168.1.10) Node 2: cluster-node2 (192.168.1.11) Virtual IP: 192.168.1.100 ``` User Permissions - Root access or sudo privileges on all cluster nodes - SSH key-based authentication between nodes (recommended) Firewall Configuration Ensure the following ports are open between cluster nodes: ```bash Corosync communication sudo firewall-cmd --permanent --add-port=5404-5406/udp sudo firewall-cmd --permanent --add-service=high-availability sudo firewall-cmd --reload ``` Installation Process Installing Pacemaker on RHEL/CentOS ```bash Enable High Availability repository sudo yum install -y centos-release-ha Install Pacemaker and related packages sudo yum install -y pacemaker corosync pcs fence-agents-all Start and enable pcsd service sudo systemctl start pcsd sudo systemctl enable pcsd Set password for hacluster user sudo passwd hacluster ``` Installing Pacemaker on Ubuntu/Debian ```bash Update package repository sudo apt update Install Pacemaker and Corosync sudo apt install -y pacemaker corosync crmsh fence-agents Start and enable services sudo systemctl start corosync sudo systemctl start pacemaker sudo systemctl enable corosync sudo systemctl enable pacemaker ``` Post-Installation Verification ```bash Check service status sudo systemctl status pacemaker sudo systemctl status corosync Verify installation sudo crm status ``` Initial Cluster Configuration Setting Up Authentication First, configure authentication between cluster nodes: ```bash On all nodes, authenticate with pcs sudo pcs host auth cluster-node1 cluster-node2 -u hacluster -p your_password Verify authentication sudo pcs host auth cluster-node1 cluster-node2 ``` Creating the Cluster ```bash Create cluster (run on one node only) sudo pcs cluster setup mycluster cluster-node1 cluster-node2 Start cluster services on all nodes sudo pcs cluster start --all Enable cluster services to start on boot sudo pcs cluster enable --all Check cluster status sudo pcs status ``` Basic Cluster Properties Configure essential cluster properties: ```bash Disable STONITH initially (enable in production) sudo pcs property set stonith-enabled=false Set cluster name sudo pcs property set cluster-name=mycluster Configure no-quorum policy for 2-node cluster sudo pcs property set no-quorum-policy=ignore Set default resource stickiness sudo pcs property set default-resource-stickiness=100 ``` Resource Configuration Creating a Virtual IP Resource A virtual IP address is one of the most common cluster resources: ```bash Create virtual IP resource sudo pcs resource create VirtualIP IPaddr2 ip=192.168.1.100 cidr_netmask=24 op monitor interval=30s Check resource status sudo pcs status resources ``` Configuring Web Server Resource Example of creating an Apache web server resource: ```bash Install Apache on all nodes sudo yum install -y httpd # RHEL/CentOS sudo apt install -y apache2 # Ubuntu/Debian Create Apache resource sudo pcs resource create WebServer apache configfile=/etc/httpd/conf/httpd.conf op monitor interval=1min Create resource group sudo pcs resource group add WebGroup VirtualIP WebServer Verify configuration sudo pcs resource show ``` Database Resource Configuration Setting up a MySQL/MariaDB cluster resource: ```bash Create MySQL resource sudo pcs resource create MySQL mysql binary="/usr/bin/mysqld_safe" config="/etc/my.cnf" datadir="/var/lib/mysql" pid="/var/lib/mysql/mysql.pid" socket="/var/lib/mysql/mysql.sock" op start timeout=60s op stop timeout=60s op monitor interval=20s timeout=30s Add to resource group sudo pcs resource group add DBGroup VirtualIP MySQL ``` File System Resources Configuring shared file system resources: ```bash Create file system resource sudo pcs resource create SharedFS Filesystem device="/dev/sdb1" directory="/shared" fstype="ext4" op monitor interval=20s Set resource constraints sudo pcs constraint colocation add SharedFS with VirtualIP INFINITY sudo pcs constraint order VirtualIP then SharedFS ``` Advanced Configuration Options Resource Constraints Location Constraints Control where resources can run: ```bash Prefer node1 for WebServer sudo pcs constraint location WebServer prefers cluster-node1=50 Prevent resource from running on specific node sudo pcs constraint location MySQL avoids cluster-node2 ``` Colocation Constraints Ensure resources run together: ```bash Keep VirtualIP and WebServer on same node sudo pcs constraint colocation add WebServer with VirtualIP INFINITY ``` Order Constraints Define startup/shutdown order: ```bash Start VirtualIP before WebServer sudo pcs constraint order VirtualIP then WebServer ``` Resource Groups vs. Clones Resource Groups ```bash Create resource group sudo pcs resource group add WebCluster VirtualIP WebServer SharedFS View group configuration sudo pcs resource show WebCluster ``` Clone Resources For services that can run on multiple nodes: ```bash Create clone resource sudo pcs resource create DLM dlm op monitor interval=30s on-fail=ignore clone interleave=true ordered=true Create master/slave resource sudo pcs resource create DRBD drbd drbd_resource=r0 op monitor interval=60s role=Master op monitor interval=59s role=Slave master master-max=1 master-node-max=1 clone-max=2 clone-node-max=1 notify=true ``` STONITH Configuration STONITH (Shoot The Other Node In The Head) is crucial for production clusters: ```bash List available fence agents sudo pcs stonith list Configure IPMI fencing sudo pcs stonith create fence-node1 fence_ipmilan pcmk_host_list="cluster-node1" ipaddr="192.168.1.101" login="admin" passwd="password" lanplus=true sudo pcs stonith create fence-node2 fence_ipmilan pcmk_host_list="cluster-node2" ipaddr="192.168.1.102" login="admin" passwd="password" lanplus=true Enable STONITH sudo pcs property set stonith-enabled=true Test fencing sudo pcs stonith fence cluster-node2 ``` Monitoring and Management Cluster Status Commands ```bash Overall cluster status sudo pcs status Detailed status sudo pcs status --full Resource-specific status sudo pcs resource show WebServer Node status sudo pcs node status Constraint information sudo pcs constraint show ``` Log Management ```bash View cluster logs sudo journalctl -u pacemaker sudo journalctl -u corosync Real-time log monitoring sudo tail -f /var/log/cluster/corosync.log sudo tail -f /var/log/pacemaker/pacemaker.log ``` Performance Monitoring ```bash Check cluster performance sudo crm_mon -1 Resource utilization sudo pcs resource utilization Cluster statistics sudo corosync-quorumtool -s ``` Troubleshooting Common Issues Split-Brain Prevention Split-brain occurs when cluster nodes can't communicate but continue operating: ```bash Configure quorum sudo pcs quorum config For 2-node clusters sudo pcs property set no-quorum-policy=ignore Add quorum device for better split-brain protection sudo pcs quorum device add model net host=qnetd-server algorithm=ffsplit ``` Resource Failures When resources fail to start or stop: ```bash Check resource failures sudo pcs status Clear resource failures sudo pcs resource cleanup WebServer Force resource to specific node sudo pcs resource move WebServer cluster-node1 Remove move constraint sudo pcs resource clear WebServer ``` Network Issues Diagnosing cluster communication problems: ```bash Check Corosync status sudo corosync-quorumtool Test multicast connectivity sudo corosync-ping -c 5 Verify cluster membership sudo pcs status corosync ``` Node Issues Handling problematic cluster nodes: ```bash Put node in standby mode sudo pcs node standby cluster-node1 Remove node from standby sudo pcs node unstandby cluster-node1 Remove failed node from cluster sudo pcs cluster node remove cluster-node1 ``` Configuration Errors ```bash Validate cluster configuration sudo crm_verify -L -V Check configuration syntax sudo pcs config show Backup and restore configuration sudo pcs config backup mycluster-backup sudo pcs config restore mycluster-backup ``` Best Practices and Tips Security Best Practices 1. Enable STONITH: Always configure fencing in production environments 2. Network Isolation: Use dedicated network interfaces for cluster traffic 3. Authentication: Implement strong authentication between nodes 4. Firewall Configuration: Properly configure firewalls to allow cluster traffic ```bash Example secure cluster configuration sudo pcs property set stonith-enabled=true sudo pcs property set stonith-action=poweroff sudo pcs property set stonith-timeout=60s ``` Performance Optimization 1. Resource Stickiness: Configure appropriate stickiness values 2. Monitoring Intervals: Balance between responsiveness and system load 3. Timeout Values: Set realistic timeout values for resources ```bash Optimize resource monitoring sudo pcs resource op defaults timeout=60s interval=10s sudo pcs property set default-resource-stickiness=1000 ``` Maintenance Procedures ```bash Put cluster in maintenance mode sudo pcs property set maintenance-mode=true Perform maintenance tasks ... Exit maintenance mode sudo pcs property set maintenance-mode=false ``` Backup and Recovery ```bash Regular configuration backup sudo pcs config backup /backup/cluster-config-$(date +%Y%m%d) Export resource configuration sudo pcs resource config > /backup/resources.cfg Document cluster topology sudo pcs status > /backup/cluster-status-$(date +%Y%m%d).txt ``` Testing Procedures 1. Planned Failover Testing: Regularly test resource migration 2. Unplanned Failure Simulation: Test node failures and recovery 3. STONITH Testing: Verify fencing mechanisms work correctly ```bash Test resource migration sudo pcs resource move WebServer cluster-node2 sudo pcs resource clear WebServer Test node failure sudo pcs node standby cluster-node1 Verify resources migrate sudo pcs node unstandby cluster-node1 ``` Conclusion and Next Steps Configuring Pacemaker on Linux provides a robust foundation for high availability clustering. This guide has covered the essential aspects of Pacemaker configuration, from basic setup to advanced resource management and troubleshooting. Key Takeaways - Proper Planning: Successful cluster deployment requires careful planning of network, storage, and application architecture - Incremental Implementation: Start with basic configurations and gradually add complexity - Testing is Critical: Regular testing ensures cluster reliability when failures occur - Documentation: Maintain detailed documentation of cluster configuration and procedures Next Steps 1. Advanced Features: Explore multi-site clustering and disaster recovery configurations 2. Integration: Integrate with monitoring systems like Nagios or Zabbix 3. Automation: Consider using configuration management tools like Ansible for cluster deployment 4. Training: Invest in team training for ongoing cluster management Additional Resources - Official Documentation: Refer to the Pacemaker project documentation for detailed technical information - Community Support: Engage with the Pacemaker community through mailing lists and forums - Professional Services: Consider professional support for mission-critical deployments By following this comprehensive guide, you now have the knowledge and tools necessary to successfully configure and manage Pacemaker clusters on Linux. Remember that high availability is not just about technology—it requires ongoing attention to monitoring, maintenance, and testing to ensure your critical services remain available when your organization needs them most. The investment in properly configured Pacemaker clustering will pay dividends in reduced downtime, improved service reliability, and enhanced business continuity. As you gain experience with Pacemaker, you'll discover additional features and optimizations that can further improve your cluster's performance and reliability.