How to test failover on Linux clusters
How to Test Failover on Linux Clusters
Introduction
High availability is a critical requirement for modern enterprise applications, and Linux clusters provide the foundation for achieving this goal. Failover testing is an essential practice that ensures your cluster can automatically switch services from a failed node to a healthy one without significant downtime or data loss. This comprehensive guide will walk you through the complete process of testing failover mechanisms on Linux clusters, from basic concepts to advanced testing scenarios.
By the end of this article, you'll understand how to plan, execute, and validate failover tests across different cluster technologies including Pacemaker, Corosync, and various application-specific clustering solutions. Whether you're managing database clusters, web services, or critical business applications, proper failover testing ensures your systems can withstand hardware failures, network issues, and planned maintenance windows.
Prerequisites and Requirements
Before diving into failover testing procedures, ensure you have the following prerequisites in place:
System Requirements
- Linux Distribution: RHEL/CentOS 7+, Ubuntu 18.04+, or SUSE Linux Enterprise Server 12+
- Minimum Hardware: At least 2 cluster nodes with adequate CPU, memory, and storage
- Network Configuration: Dedicated cluster network with proper VLAN configuration
- Storage: Shared storage accessible by all cluster nodes (SAN, NFS, or distributed storage)
Software Components
- Cluster Stack: Pacemaker and Corosync installed and configured
- Resource Agents: Appropriate resource agents for your applications
- Monitoring Tools: Cluster monitoring software (optional but recommended)
- Administrative Access: Root or sudo privileges on all cluster nodes
Knowledge Prerequisites
- Basic understanding of Linux system administration
- Familiarity with cluster concepts and terminology
- Understanding of your specific application architecture
- Knowledge of network troubleshooting basics
Understanding Cluster Failover Mechanisms
Types of Failover
Active-Passive Failover: Only one node actively serves requests while others remain on standby. When the active node fails, a passive node takes over.
Active-Active Failover: Multiple nodes simultaneously serve requests. When one node fails, the remaining nodes absorb the additional load.
N+1 Redundancy: N nodes handle normal load, with one additional node serving as a hot spare.
Failover Triggers
Common scenarios that trigger failover include:
- Hardware failures (CPU, memory, disk, network interface)
- Software crashes or hangs
- Network connectivity issues
- Resource exhaustion
- Manual failover for maintenance
Step-by-Step Failover Testing Procedures
Phase 1: Pre-Test Preparation
1.1 Document Current Cluster State
Before beginning any failover tests, document the current cluster configuration and status:
```bash
Check cluster status
pcs status
View cluster configuration
pcs config
Check resource constraints
pcs constraint list
Verify node status
crm_mon -1
```
1.2 Create Test Plan
Develop a comprehensive test plan that includes:
- Test objectives and success criteria
- Specific failure scenarios to simulate
- Expected recovery times (RTO - Recovery Time Objective)
- Data loss tolerances (RPO - Recovery Point Objective)
- Rollback procedures if tests fail
1.3 Establish Baseline Metrics
Collect baseline performance metrics:
```bash
Monitor cluster resources
watch -n 2 'pcs status'
Check application response times
curl -w "@curl-format.txt" -o /dev/null -s "http://cluster-vip/health"
Monitor system resources
iostat -x 1
vmstat 1
```
Phase 2: Basic Failover Testing
2.1 Graceful Node Shutdown Test
Start with the least disruptive test by gracefully shutting down a cluster node:
```bash
Put node in standby mode
pcs node standby node1
Monitor failover process
watch 'pcs status'
Verify resources moved to other nodes
pcs resource show
Bring node back online
pcs node unstandby node1
```
Expected Behavior: Resources should migrate cleanly to available nodes within the configured timeout period.
2.2 Service Stop Test
Test application-level failover by stopping specific services:
```bash
Stop a cluster resource
pcs resource disable web-server
Monitor cluster response
crm_mon -1
Verify failover completed
pcs status resources
Re-enable the resource
pcs resource enable web-server
```
2.3 Network Interface Failover Test
Simulate network failures by disabling network interfaces:
```bash
Disable primary network interface
sudo ip link set eth0 down
Monitor cluster communication
tail -f /var/log/cluster/corosync.log
Verify cluster maintains quorum
pcs status
Re-enable interface
sudo ip link set eth0 up
```
Phase 3: Advanced Failover Scenarios
3.1 Simulated Hardware Failure
Test more realistic failure scenarios using system tools:
```bash
Simulate high CPU load
stress --cpu 8 --timeout 300s
Simulate memory pressure
stress --vm 2 --vm-bytes 1G --timeout 300s
Simulate disk I/O issues
stress --io 4 --hdd 2 --timeout 300s
```
3.2 Network Partition Testing
Create network partitions to test split-brain scenarios:
```bash
Block communication between specific nodes
iptables -A INPUT -s node2_ip -j DROP
iptables -A OUTPUT -d node2_ip -j DROP
Monitor cluster behavior
watch 'pcs status'
Restore communication
iptables -D INPUT -s node2_ip -j DROP
iptables -D OUTPUT -d node2_ip -j DROP
```
3.3 Storage Failover Testing
Test storage-related failures:
```bash
Simulate storage unavailability
sudo umount /shared/storage
Monitor cluster response to storage failure
tail -f /var/log/messages
Verify resource behavior
pcs status resources
Restore storage access
sudo mount /shared/storage
```
Phase 4: Database Cluster Failover Testing
For database clusters, additional testing considerations apply:
4.1 MySQL/MariaDB Cluster Testing
```bash
Check current master status
mysql -e "SHOW MASTER STATUS;"
Stop MySQL service on master
systemctl stop mysql
Verify failover to slave
mysql -h cluster-vip -e "SHOW MASTER STATUS;"
Check replication status
mysql -e "SHOW SLAVE STATUS\G"
```
4.2 PostgreSQL Cluster Testing
```bash
Check current primary server
patronictl list
Simulate primary failure
systemctl stop postgresql
Monitor failover process
patronictl list
Verify new primary is functional
psql -h cluster-vip -c "SELECT pg_is_in_recovery();"
```
Phase 5: Application-Level Failover Validation
5.1 Web Application Testing
```bash
Continuous application monitoring during failover
while true; do
curl -s -o /dev/null -w "%{http_code} %{time_total}\n" http://cluster-vip/
sleep 1
done
Load testing during failover
ab -n 1000 -c 10 http://cluster-vip/
```
5.2 Session Persistence Testing
Verify that user sessions are maintained during failover:
```bash
Create test session
curl -c cookies.txt -b cookies.txt http://cluster-vip/login
Trigger failover while maintaining session
curl -c cookies.txt -b cookies.txt http://cluster-vip/protected-page
```
Monitoring and Validation
Real-Time Monitoring During Tests
Set up comprehensive monitoring to track failover behavior:
```bash
Terminal 1: Cluster status monitoring
watch -n 1 'pcs status'
Terminal 2: System resource monitoring
watch -n 1 'free -h && uptime'
Terminal 3: Application monitoring
watch -n 1 'curl -s -o /dev/null -w "Response: %{http_code} Time: %{time_total}s\n" http://cluster-vip/'
Terminal 4: Log monitoring
tail -f /var/log/cluster/corosync.log /var/log/pacemaker.log
```
Post-Failover Validation Checklist
After each failover test, validate the following:
1. Cluster Status: All nodes show correct status
2. Resource Location: Resources are running on expected nodes
3. Application Functionality: All application features work correctly
4. Data Integrity: No data corruption or loss occurred
5. Performance: Response times within acceptable ranges
6. Logs: No critical errors in cluster or application logs
Common Issues and Troubleshooting
Split-Brain Scenarios
Problem: Cluster nodes cannot communicate, leading to multiple nodes believing they're the primary.
Solution:
```bash
Configure proper quorum policies
pcs property set no-quorum-policy=stop
Implement STONITH (Shoot The Other Node In The Head)
pcs stonith create fence_node1 fence_ipmilan \
pcmk_host_list="node1" ipaddr="node1-ipmi" \
login="admin" passwd="password"
```
Resource Start Failures
Problem: Resources fail to start on failover target nodes.
Troubleshooting Steps:
```bash
Check resource configuration
pcs resource show resource_name
Verify resource agent logs
grep resource_name /var/log/pacemaker.log
Test resource agent manually
/usr/lib/ocf/resource.d/provider/agent start
Check resource dependencies
pcs constraint list
```
Slow Failover Times
Problem: Failover takes longer than expected.
Solutions:
```bash
Adjust failure detection timeouts
pcs resource update resource_name op monitor interval=10s timeout=30s
Configure faster failure detection
pcs property set stonith-timeout=60s
Optimize resource agent parameters
pcs resource update resource_name op start timeout=60s
```
Network Communication Issues
Problem: Cluster nodes lose communication intermittently.
Diagnostic Commands:
```bash
Check network connectivity
ping -c 5 other_node_ip
Verify cluster network configuration
corosync-cfgtool -s
Check firewall rules
iptables -L -n
Monitor network traffic
tcpdump -i eth0 port 5405
```
Storage-Related Failures
Problem: Shared storage becomes unavailable during failover.
Resolution Steps:
```bash
Check storage connectivity
ls -la /shared/storage/
Verify mount points
mount | grep shared
Check storage cluster status (if using clustered storage)
pcs status
Test storage performance
dd if=/dev/zero of=/shared/storage/test bs=1M count=100
```
Best Practices and Professional Tips
Testing Frequency and Scheduling
- Regular Testing: Conduct basic failover tests monthly
- Comprehensive Testing: Perform full failover scenarios quarterly
- Maintenance Windows: Schedule disruptive tests during planned maintenance
- Documentation: Keep detailed records of all test results
Test Environment Considerations
```bash
Create isolated test environments
Use network namespaces for isolation
ip netns add test-cluster
ip netns exec test-cluster bash
Implement proper change control
git init /etc/cluster/
git add /etc/cluster/
git commit -m "Baseline cluster configuration"
```
Automation and Scripting
Develop automated testing scripts for consistent results:
```bash
#!/bin/bash
Automated failover test script
LOG_FILE="/var/log/failover-test-$(date +%Y%m%d-%H%M%S).log"
log_message() {
echo "$(date): $1" | tee -a $LOG_FILE
}
test_failover() {
local node_to_fail=$1
log_message "Starting failover test for $node_to_fail"
# Record baseline
pcs status > /tmp/baseline-status.txt
# Trigger failover
pcs node standby $node_to_fail
# Monitor failover completion
timeout=300
while [ $timeout -gt 0 ]; do
if pcs status | grep -q "Started.*$node_to_fail" == false; then
log_message "Failover completed successfully"
break
fi
sleep 5
timeout=$((timeout-5))
done
# Validate results
validate_cluster_health
# Restore node
pcs node unstandby $node_to_fail
}
validate_cluster_health() {
# Add validation logic here
log_message "Validating cluster health"
}
Execute tests
test_failover "node1"
```
Security Considerations
- Access Control: Limit who can perform failover tests
- Audit Logging: Enable comprehensive audit logging
- Network Security: Secure cluster communication channels
- Authentication: Use strong authentication for cluster management
Performance Optimization
```bash
Optimize cluster timing parameters
pcs property set cluster-recheck-interval=60s
pcs property set migration-threshold=3
pcs property set failure-timeout=300s
Configure resource stickiness
pcs resource update web-server resource-stickiness=100
Implement proper resource ordering
pcs constraint order storage-service then web-service
```
Documentation and Reporting
Maintain comprehensive documentation including:
- Test procedures and results
- Configuration changes made during testing
- Performance metrics and trends
- Lessons learned and improvements
- Emergency contact information
Advanced Testing Scenarios
Disaster Recovery Testing
Test complete site failures:
```bash
Simulate entire datacenter failure
for node in node1 node2 node3; do
pcs node standby $node
done
Verify DR site activation
(Implementation depends on DR architecture)
```
Cascading Failure Testing
Test multiple simultaneous failures:
```bash
Script to simulate cascading failures
#!/bin/bash
simulate_cascade() {
# Fail primary storage
systemctl stop storage-service
sleep 30
# Fail primary network
ip link set eth0 down
sleep 30
# Fail backup node
pcs node standby backup-node
# Monitor cluster response
watch 'pcs status'
}
```
Load Testing During Failover
Combine failover testing with load testing:
```bash
Start load test
siege -c 50 -t 300s http://cluster-vip/ &
Trigger failover during load test
sleep 60
pcs node standby primary-node
Monitor performance impact
iostat -x 1 &
vmstat 1 &
```
Conclusion and Next Steps
Proper failover testing is crucial for maintaining high availability in Linux cluster environments. By following the comprehensive procedures outlined in this guide, you can ensure your cluster systems will respond appropriately to various failure scenarios.
Key Takeaways
1. Start Simple: Begin with basic graceful failover tests before attempting complex scenarios
2. Monitor Everything: Comprehensive monitoring during tests provides valuable insights
3. Document Results: Maintain detailed records of all test activities and outcomes
4. Automate When Possible: Develop scripts for consistent and repeatable testing
5. Test Regularly: Regular testing ensures your cluster remains resilient over time
Next Steps
After mastering basic failover testing, consider exploring:
- Advanced cluster architectures like geo-distributed clusters
- Integration with cloud platforms for hybrid failover scenarios
- Container orchestration failover using Kubernetes or similar platforms
- Database-specific clustering solutions like MySQL Group Replication or PostgreSQL Patroni
- Monitoring and alerting integration with tools like Prometheus and Grafana
Continuous Improvement
Failover testing should be an ongoing process that evolves with your infrastructure. Regular review and refinement of your testing procedures will help ensure your Linux clusters continue to provide the high availability your applications require.
Remember that each cluster environment is unique, and testing procedures should be tailored to your specific applications, infrastructure, and business requirements. Start with the fundamentals covered in this guide, then adapt and expand your testing strategy based on your organization's needs and lessons learned from actual testing experiences.
By implementing comprehensive failover testing practices, you'll build confidence in your cluster's reliability and ensure your critical applications remain available when your users need them most.