How to test failover on Linux clusters - High Availability & Clustering Guide

How to Test Failover on Linux Clusters Introduction High availability is a critical requirement for modern enterprise applications, and Linux clusters provide the foundation for achieving this goal. Failover testing is an essential practice that ensures your cluster can automatically switch services from a failed node to a healthy one without significant downtime or data loss. This comprehensive guide will walk you through the complete process of testing failover mechanisms on Linux clusters, from basic concepts to advanced testing scenarios. By the end of this article, you'll understand how to plan, execute, and validate failover tests across different cluster technologies including Pacemaker, Corosync, and various application-specific clustering solutions. Whether you're managing database clusters, web services, or critical business applications, proper failover testing ensures your systems can withstand hardware failures, network issues, and planned maintenance windows. Prerequisites and Requirements Before diving into failover testing procedures, ensure you have the following prerequisites in place: System Requirements - Linux Distribution: RHEL/CentOS 7+, Ubuntu 18.04+, or SUSE Linux Enterprise Server 12+ - Minimum Hardware: At least 2 cluster nodes with adequate CPU, memory, and storage - Network Configuration: Dedicated cluster network with proper VLAN configuration - Storage: Shared storage accessible by all cluster nodes (SAN, NFS, or distributed storage) Software Components - Cluster Stack: Pacemaker and Corosync installed and configured - Resource Agents: Appropriate resource agents for your applications - Monitoring Tools: Cluster monitoring software (optional but recommended) - Administrative Access: Root or sudo privileges on all cluster nodes Knowledge Prerequisites - Basic understanding of Linux system administration - Familiarity with cluster concepts and terminology - Understanding of your specific application architecture - Knowledge of network troubleshooting basics Understanding Cluster Failover Mechanisms Types of Failover Active-Passive Failover: Only one node actively serves requests while others remain on standby. When the active node fails, a passive node takes over. Active-Active Failover: Multiple nodes simultaneously serve requests. When one node fails, the remaining nodes absorb the additional load. N+1 Redundancy: N nodes handle normal load, with one additional node serving as a hot spare. Failover Triggers Common scenarios that trigger failover include: - Hardware failures (CPU, memory, disk, network interface) - Software crashes or hangs - Network connectivity issues - Resource exhaustion - Manual failover for maintenance Step-by-Step Failover Testing Procedures Phase 1: Pre-Test Preparation 1.1 Document Current Cluster State Before beginning any failover tests, document the current cluster configuration and status: ```bash Check cluster status pcs status View cluster configuration pcs config Check resource constraints pcs constraint list Verify node status crm_mon -1 ``` 1.2 Create Test Plan Develop a comprehensive test plan that includes: - Test objectives and success criteria - Specific failure scenarios to simulate - Expected recovery times (RTO - Recovery Time Objective) - Data loss tolerances (RPO - Recovery Point Objective) - Rollback procedures if tests fail 1.3 Establish Baseline Metrics Collect baseline performance metrics: ```bash Monitor cluster resources watch -n 2 'pcs status' Check application response times curl -w "@curl-format.txt" -o /dev/null -s "http://cluster-vip/health" Monitor system resources iostat -x 1 vmstat 1 ``` Phase 2: Basic Failover Testing 2.1 Graceful Node Shutdown Test Start with the least disruptive test by gracefully shutting down a cluster node: ```bash Put node in standby mode pcs node standby node1 Monitor failover process watch 'pcs status' Verify resources moved to other nodes pcs resource show Bring node back online pcs node unstandby node1 ``` Expected Behavior: Resources should migrate cleanly to available nodes within the configured timeout period. 2.2 Service Stop Test Test application-level failover by stopping specific services: ```bash Stop a cluster resource pcs resource disable web-server Monitor cluster response crm_mon -1 Verify failover completed pcs status resources Re-enable the resource pcs resource enable web-server ``` 2.3 Network Interface Failover Test Simulate network failures by disabling network interfaces: ```bash Disable primary network interface sudo ip link set eth0 down Monitor cluster communication tail -f /var/log/cluster/corosync.log Verify cluster maintains quorum pcs status Re-enable interface sudo ip link set eth0 up ``` Phase 3: Advanced Failover Scenarios 3.1 Simulated Hardware Failure Test more realistic failure scenarios using system tools: ```bash Simulate high CPU load stress --cpu 8 --timeout 300s Simulate memory pressure stress --vm 2 --vm-bytes 1G --timeout 300s Simulate disk I/O issues stress --io 4 --hdd 2 --timeout 300s ``` 3.2 Network Partition Testing Create network partitions to test split-brain scenarios: ```bash Block communication between specific nodes iptables -A INPUT -s node2_ip -j DROP iptables -A OUTPUT -d node2_ip -j DROP Monitor cluster behavior watch 'pcs status' Restore communication iptables -D INPUT -s node2_ip -j DROP iptables -D OUTPUT -d node2_ip -j DROP ``` 3.3 Storage Failover Testing Test storage-related failures: ```bash Simulate storage unavailability sudo umount /shared/storage Monitor cluster response to storage failure tail -f /var/log/messages Verify resource behavior pcs status resources Restore storage access sudo mount /shared/storage ``` Phase 4: Database Cluster Failover Testing For database clusters, additional testing considerations apply: 4.1 MySQL/MariaDB Cluster Testing ```bash Check current master status mysql -e "SHOW MASTER STATUS;" Stop MySQL service on master systemctl stop mysql Verify failover to slave mysql -h cluster-vip -e "SHOW MASTER STATUS;" Check replication status mysql -e "SHOW SLAVE STATUS\G" ``` 4.2 PostgreSQL Cluster Testing ```bash Check current primary server patronictl list Simulate primary failure systemctl stop postgresql Monitor failover process patronictl list Verify new primary is functional psql -h cluster-vip -c "SELECT pg_is_in_recovery();" ``` Phase 5: Application-Level Failover Validation 5.1 Web Application Testing ```bash Continuous application monitoring during failover while true; do curl -s -o /dev/null -w "%{http_code} %{time_total}\n" http://cluster-vip/ sleep 1 done Load testing during failover ab -n 1000 -c 10 http://cluster-vip/ ``` 5.2 Session Persistence Testing Verify that user sessions are maintained during failover: ```bash Create test session curl -c cookies.txt -b cookies.txt http://cluster-vip/login Trigger failover while maintaining session curl -c cookies.txt -b cookies.txt http://cluster-vip/protected-page ``` Monitoring and Validation Real-Time Monitoring During Tests Set up comprehensive monitoring to track failover behavior: ```bash Terminal 1: Cluster status monitoring watch -n 1 'pcs status' Terminal 2: System resource monitoring watch -n 1 'free -h && uptime' Terminal 3: Application monitoring watch -n 1 'curl -s -o /dev/null -w "Response: %{http_code} Time: %{time_total}s\n" http://cluster-vip/' Terminal 4: Log monitoring tail -f /var/log/cluster/corosync.log /var/log/pacemaker.log ``` Post-Failover Validation Checklist After each failover test, validate the following: 1. Cluster Status: All nodes show correct status 2. Resource Location: Resources are running on expected nodes 3. Application Functionality: All application features work correctly 4. Data Integrity: No data corruption or loss occurred 5. Performance: Response times within acceptable ranges 6. Logs: No critical errors in cluster or application logs Common Issues and Troubleshooting Split-Brain Scenarios Problem: Cluster nodes cannot communicate, leading to multiple nodes believing they're the primary. Solution: ```bash Configure proper quorum policies pcs property set no-quorum-policy=stop Implement STONITH (Shoot The Other Node In The Head) pcs stonith create fence_node1 fence_ipmilan \ pcmk_host_list="node1" ipaddr="node1-ipmi" \ login="admin" passwd="password" ``` Resource Start Failures Problem: Resources fail to start on failover target nodes. Troubleshooting Steps: ```bash Check resource configuration pcs resource show resource_name Verify resource agent logs grep resource_name /var/log/pacemaker.log Test resource agent manually /usr/lib/ocf/resource.d/provider/agent start Check resource dependencies pcs constraint list ``` Slow Failover Times Problem: Failover takes longer than expected. Solutions: ```bash Adjust failure detection timeouts pcs resource update resource_name op monitor interval=10s timeout=30s Configure faster failure detection pcs property set stonith-timeout=60s Optimize resource agent parameters pcs resource update resource_name op start timeout=60s ``` Network Communication Issues Problem: Cluster nodes lose communication intermittently. Diagnostic Commands: ```bash Check network connectivity ping -c 5 other_node_ip Verify cluster network configuration corosync-cfgtool -s Check firewall rules iptables -L -n Monitor network traffic tcpdump -i eth0 port 5405 ``` Storage-Related Failures Problem: Shared storage becomes unavailable during failover. Resolution Steps: ```bash Check storage connectivity ls -la /shared/storage/ Verify mount points mount | grep shared Check storage cluster status (if using clustered storage) pcs status Test storage performance dd if=/dev/zero of=/shared/storage/test bs=1M count=100 ``` Best Practices and Professional Tips Testing Frequency and Scheduling - Regular Testing: Conduct basic failover tests monthly - Comprehensive Testing: Perform full failover scenarios quarterly - Maintenance Windows: Schedule disruptive tests during planned maintenance - Documentation: Keep detailed records of all test results Test Environment Considerations ```bash Create isolated test environments Use network namespaces for isolation ip netns add test-cluster ip netns exec test-cluster bash Implement proper change control git init /etc/cluster/ git add /etc/cluster/ git commit -m "Baseline cluster configuration" ``` Automation and Scripting Develop automated testing scripts for consistent results: ```bash #!/bin/bash Automated failover test script LOG_FILE="/var/log/failover-test-$(date +%Y%m%d-%H%M%S).log" log_message() { echo "$(date): $1" | tee -a $LOG_FILE } test_failover() { local node_to_fail=$1 log_message "Starting failover test for $node_to_fail" # Record baseline pcs status > /tmp/baseline-status.txt # Trigger failover pcs node standby $node_to_fail # Monitor failover completion timeout=300 while [ $timeout -gt 0 ]; do if pcs status | grep -q "Started.*$node_to_fail" == false; then log_message "Failover completed successfully" break fi sleep 5 timeout=$((timeout-5)) done # Validate results validate_cluster_health # Restore node pcs node unstandby $node_to_fail } validate_cluster_health() { # Add validation logic here log_message "Validating cluster health" } Execute tests test_failover "node1" ``` Security Considerations - Access Control: Limit who can perform failover tests - Audit Logging: Enable comprehensive audit logging - Network Security: Secure cluster communication channels - Authentication: Use strong authentication for cluster management Performance Optimization ```bash Optimize cluster timing parameters pcs property set cluster-recheck-interval=60s pcs property set migration-threshold=3 pcs property set failure-timeout=300s Configure resource stickiness pcs resource update web-server resource-stickiness=100 Implement proper resource ordering pcs constraint order storage-service then web-service ``` Documentation and Reporting Maintain comprehensive documentation including: - Test procedures and results - Configuration changes made during testing - Performance metrics and trends - Lessons learned and improvements - Emergency contact information Advanced Testing Scenarios Disaster Recovery Testing Test complete site failures: ```bash Simulate entire datacenter failure for node in node1 node2 node3; do pcs node standby $node done Verify DR site activation (Implementation depends on DR architecture) ``` Cascading Failure Testing Test multiple simultaneous failures: ```bash Script to simulate cascading failures #!/bin/bash simulate_cascade() { # Fail primary storage systemctl stop storage-service sleep 30 # Fail primary network ip link set eth0 down sleep 30 # Fail backup node pcs node standby backup-node # Monitor cluster response watch 'pcs status' } ``` Load Testing During Failover Combine failover testing with load testing: ```bash Start load test siege -c 50 -t 300s http://cluster-vip/ & Trigger failover during load test sleep 60 pcs node standby primary-node Monitor performance impact iostat -x 1 & vmstat 1 & ``` Conclusion and Next Steps Proper failover testing is crucial for maintaining high availability in Linux cluster environments. By following the comprehensive procedures outlined in this guide, you can ensure your cluster systems will respond appropriately to various failure scenarios. Key Takeaways 1. Start Simple: Begin with basic graceful failover tests before attempting complex scenarios 2. Monitor Everything: Comprehensive monitoring during tests provides valuable insights 3. Document Results: Maintain detailed records of all test activities and outcomes 4. Automate When Possible: Develop scripts for consistent and repeatable testing 5. Test Regularly: Regular testing ensures your cluster remains resilient over time Next Steps After mastering basic failover testing, consider exploring: - Advanced cluster architectures like geo-distributed clusters - Integration with cloud platforms for hybrid failover scenarios - Container orchestration failover using Kubernetes or similar platforms - Database-specific clustering solutions like MySQL Group Replication or PostgreSQL Patroni - Monitoring and alerting integration with tools like Prometheus and Grafana Continuous Improvement Failover testing should be an ongoing process that evolves with your infrastructure. Regular review and refinement of your testing procedures will help ensure your Linux clusters continue to provide the high availability your applications require. Remember that each cluster environment is unique, and testing procedures should be tailored to your specific applications, infrastructure, and business requirements. Start with the fundamentals covered in this guide, then adapt and expand your testing strategy based on your organization's needs and lessons learned from actual testing experiences. By implementing comprehensive failover testing practices, you'll build confidence in your cluster's reliability and ensure your critical applications remain available when your users need them most.