How to mark disk failed/add → mdadm /dev/md0 --fail /dev/sdb1 --remove /dev/sdb1 --add /dev/sdb1 - RAID Management with mdadm Guide

How to Mark Disk Failed, Remove, and Re-add in mdadm RAID Arrays Table of Contents - [Introduction](#introduction) - [Prerequisites](#prerequisites) - [Understanding mdadm Operations](#understanding-mdadm-operations) - [Step-by-Step Guide](#step-by-step-guide) - [Practical Examples](#practical-examples) - [Common Use Cases](#common-use-cases) - [Troubleshooting](#troubleshooting) - [Best Practices](#best-practices) - [Advanced Scenarios](#advanced-scenarios) - [Monitoring and Verification](#monitoring-and-verification) - [Conclusion](#conclusion) Introduction Managing RAID arrays with mdadm (Multiple Device Administration) is a critical skill for Linux system administrators. One of the most important maintenance tasks involves marking disks as failed, removing them from the array, and re-adding them when necessary. This comprehensive guide will walk you through the complete process of using the mdadm command sequence: `mdadm /dev/md0 --fail /dev/sdb1 --remove /dev/sdb1 --add /dev/sdb1`. This operation is commonly performed during disk replacement, testing scenarios, or when troubleshooting RAID array issues. Understanding these operations is essential for maintaining data integrity and ensuring optimal RAID performance in production environments. Prerequisites Before proceeding with mdadm disk operations, ensure you have the following: System Requirements - Linux system with mdadm installed - Root or sudo privileges - Active RAID array (md device) - Basic understanding of RAID concepts Software Requirements ```bash Verify mdadm installation which mdadm mdadm --version Install mdadm if not present (Ubuntu/Debian) sudo apt-get update sudo apt-get install mdadm Install mdadm if not present (CentOS/RHEL) sudo yum install mdadm or for newer versions sudo dnf install mdadm ``` Safety Considerations - Always backup critical data before performing disk operations - Ensure the RAID array has redundancy (not RAID 0) - Verify array status before making changes - Have replacement hardware ready if needed Understanding mdadm Operations The Three-Step Process The command `mdadm /dev/md0 --fail /dev/sdb1 --remove /dev/sdb1 --add /dev/sdb1` performs three distinct operations in sequence: 1. --fail: Marks the specified device as failed 2. --remove: Removes the failed device from the array 3. --add: Adds the device back to the array Why This Sequence Matters This sequence is crucial because: - mdadm won't remove a functioning disk without marking it as failed first - The system needs to recognize the disk as problematic before removal - Re-adding triggers the rebuild process for data synchronization RAID Level Compatibility This operation works with: - RAID 1 (Mirror): Fully supported, no data loss risk - RAID 5: Supported with one disk failure tolerance - RAID 6: Supported with two disk failure tolerance - RAID 10: Supported depending on which disks fail - RAID 0: Not recommended (no redundancy) Step-by-Step Guide Step 1: Check Array Status Before making any changes, examine the current array status: ```bash Check detailed array information sudo mdadm --detail /dev/md0 Quick status check cat /proc/mdstat Check all arrays sudo mdadm --detail --scan ``` Expected output example: ``` /dev/md0: Version : 1.2 Creation Time : Mon Jan 15 10:30:25 2024 Raid Level : raid1 Array Size : 1048576 (1024.00 MiB 1073.74 MB) Used Dev Size : 1048576 (1024.00 MiB 1073.74 MB) Raid Devices : 2 Total Devices : 2 Persistence : Superblock is persistent Update Time : Mon Jan 15 11:45:30 2024 State : clean Active Devices : 2 Working Devices : 2 Failed Devices : 0 Spare Devices : 0 Name : server1:0 UUID : 12345678:abcdefgh:ijklmnop:qrstuvwx Events : 17 Number Major Minor RaidDevice State 0 8 17 0 active sync /dev/sdb1 1 8 33 1 active sync /dev/sdc1 ``` Step 2: Execute the Fail Operation Mark the target disk as failed: ```bash sudo mdadm /dev/md0 --fail /dev/sdb1 ``` Verify the operation: ```bash sudo mdadm --detail /dev/md0 | grep -E "(State|Failed)" ``` You should see: ``` State : clean, degraded Failed Devices : 1 1 8 17 - faulty /dev/sdb1 ``` Step 3: Remove the Failed Disk Remove the failed disk from the array: ```bash sudo mdadm /dev/md0 --remove /dev/sdb1 ``` Confirmation output: ``` mdadm: hot removed /dev/sdb1 from /dev/md0 ``` Verify removal: ```bash sudo mdadm --detail /dev/md0 ``` The device should no longer appear in the active devices list. Step 4: Re-add the Disk Add the disk back to the array: ```bash sudo mdadm /dev/md0 --add /dev/sdb1 ``` Confirmation output: ``` mdadm: added /dev/sdb1 ``` Step 5: Monitor the Rebuild Process After re-adding, the array will begin rebuilding: ```bash Watch the rebuild progress watch cat /proc/mdstat Or check periodically sudo mdadm --detail /dev/md0 ``` During rebuild, you'll see: ``` Personalities : [raid1] md0 : active raid1 sdb1[2] sdc1[1] 1048576 blocks super 1.2 [2/1] [_U] [==>..................] recovery = 12.5% (131072/1048576) finish=0.8min speed=131072K/sec ``` Practical Examples Example 1: Complete Single Command Execution Execute all three operations in one command: ```bash Single command execution sudo mdadm /dev/md0 --fail /dev/sdb1 --remove /dev/sdb1 --add /dev/sdb1 Verify the operation sudo mdadm --detail /dev/md0 ``` Example 2: Scripted Approach with Verification ```bash #!/bin/bash ARRAY="/dev/md0" DEVICE="/dev/sdb1" echo "Starting disk maintenance on $DEVICE in array $ARRAY" Check initial status echo "Initial array status:" mdadm --detail $ARRAY | grep -E "(State|Active|Failed)" Fail the device echo "Marking $DEVICE as failed..." mdadm $ARRAY --fail $DEVICE sleep 2 Remove the device echo "Removing $DEVICE..." mdadm $ARRAY --remove $DEVICE sleep 2 Re-add the device echo "Re-adding $DEVICE..." mdadm $ARRAY --add $DEVICE echo "Operation complete. Monitoring rebuild..." watch -n 5 'cat /proc/mdstat' ``` Example 3: Multiple Disk Operations For arrays with multiple disks needing attention: ```bash Define array and devices ARRAY="/dev/md0" DEVICES=("/dev/sdb1" "/dev/sdc1") for DEVICE in "${DEVICES[@]}"; do echo "Processing $DEVICE..." sudo mdadm $ARRAY --fail $DEVICE --remove $DEVICE --add $DEVICE # Wait for rebuild to start before processing next disk sleep 10 done ``` Common Use Cases Disk Testing and Validation Testing disk reliability by simulating failures: ```bash Test scenario: simulate disk failure and recovery sudo mdadm /dev/md0 --fail /dev/sdb1 --remove /dev/sdb1 Perform disk diagnostics sudo badblocks -v /dev/sdb1 Re-add if disk passes tests sudo mdadm /dev/md0 --add /dev/sdb1 ``` Forced Rebuild for Performance Sometimes forcing a rebuild can improve performance: ```bash Force rebuild of a specific disk sudo mdadm /dev/md0 --fail /dev/sdb1 --remove /dev/sdb1 --add /dev/sdb1 Monitor rebuild performance iostat -x 1 | grep -E "(Device|md0|sdb1)" ``` Disk Replacement Preparation Preparing for physical disk replacement: ```bash Step 1: Mark as failed and remove sudo mdadm /dev/md0 --fail /dev/sdb1 --remove /dev/sdb1 Step 2: Physical replacement (power down if necessary) Replace physical disk Step 3: Partition new disk to match sudo sfdisk -d /dev/sdc | sudo sfdisk /dev/sdb Step 4: Add new disk to array sudo mdadm /dev/md0 --add /dev/sdb1 ``` Troubleshooting Common Error Messages and Solutions Error: "Device or resource busy" ```bash Error message mdadm: cannot remove /dev/sdb1: Device or resource busy Solution: Check what's using the device lsof | grep sdb1 fuser -v /dev/sdb1 Force removal if safe sudo mdadm /dev/md0 --remove /dev/sdb1 --force ``` Error: "No such device" ```bash Error message mdadm: cannot find /dev/sdb1: No such device Solution: Verify device exists and check naming ls -la /dev/sd* lsblk Use correct device path sudo mdadm /dev/md0 --fail /dev/sdb --remove /dev/sdb --add /dev/sdb ``` Error: "Device already exists in array" ```bash Error message mdadm: /dev/sdb1 already exists in array Solution: Check current array status sudo mdadm --detail /dev/md0 Remove first, then re-add sudo mdadm /dev/md0 --remove /dev/sdb1 sudo mdadm /dev/md0 --add /dev/sdb1 ``` Rebuild Issues Slow Rebuild Performance ```bash Check current rebuild speed limits cat /proc/sys/dev/raid/speed_limit_min cat /proc/sys/dev/raid/speed_limit_max Increase rebuild speed (temporarily) echo 50000 | sudo tee /proc/sys/dev/raid/speed_limit_min echo 200000 | sudo tee /proc/sys/dev/raid/speed_limit_max ``` Rebuild Stuck or Failed ```bash Check for errors dmesg | grep -i raid journalctl -u mdmonitor Force rebuild restart sudo mdadm --stop /dev/md0 sudo mdadm --assemble /dev/md0 /dev/sdb1 /dev/sdc1 ``` Array Won't Start After Operations ```bash Examine array components sudo mdadm --examine /dev/sdb1 sudo mdadm --examine /dev/sdc1 Force assembly if metadata is intact sudo mdadm --assemble --force /dev/md0 /dev/sdb1 /dev/sdc1 Scan and assemble all arrays sudo mdadm --assemble --scan ``` Best Practices Pre-Operation Checks 1. Verify Array Health ```bash # Comprehensive health check sudo mdadm --detail /dev/md0 sudo mdadm --examine /dev/sdb1 cat /proc/mdstat ``` 2. Check System Load ```bash # Ensure system isn't under heavy load uptime iostat 1 5 ``` 3. Backup Critical Data ```bash # Create backup before operations rsync -av /mnt/raid/ /backup/location/ ``` During Operations 1. Monitor Progress Actively ```bash # Real-time monitoring watch -n 2 'cat /proc/mdstat; echo "---"; mdadm --detail /dev/md0 | grep -E "(State|Recovery)"' ``` 2. Log Operations ```bash # Log all operations sudo mdadm /dev/md0 --fail /dev/sdb1 --remove /dev/sdb1 --add /dev/sdb1 | tee -a /var/log/mdadm-operations.log ``` Post-Operation Verification 1. Verify Array Integrity ```bash # Check array consistency echo check > /sys/block/md0/md/sync_action cat /sys/block/md0/md/mismatch_cnt ``` 2. Update Configuration ```bash # Update mdadm configuration sudo mdadm --detail --scan >> /etc/mdadm/mdadm.conf sudo update-initramfs -u ``` Performance Optimization 1. Optimize Rebuild Speed ```bash # Set appropriate rebuild speeds echo 10000 > /proc/sys/dev/raid/speed_limit_min echo 100000 > /proc/sys/dev/raid/speed_limit_max ``` 2. Schedule Operations During Low Usage ```bash # Use cron for scheduled maintenance # 0 2 0 /usr/local/bin/raid-maintenance.sh ``` Advanced Scenarios Hot Spare Configuration Configure automatic failover with hot spares: ```bash Add hot spare sudo mdadm /dev/md0 --add /dev/sdd1 Verify spare is available sudo mdadm --detail /dev/md0 | grep spare Test automatic failover sudo mdadm /dev/md0 --fail /dev/sdb1 Hot spare should automatically activate ``` Multi-Array Operations Managing multiple arrays simultaneously: ```bash #!/bin/bash ARRAYS=("/dev/md0" "/dev/md1" "/dev/md2") DEVICES=("/dev/sdb1" "/dev/sdc1" "/dev/sdd1") for i in "${!ARRAYS[@]}"; do ARRAY="${ARRAYS[$i]}" DEVICE="${DEVICES[$i]}" echo "Processing $ARRAY with device $DEVICE" mdadm $ARRAY --fail $DEVICE --remove $DEVICE --add $DEVICE # Wait between arrays to manage system load sleep 30 done ``` Integration with Monitoring Systems Set up automated monitoring and alerting: ```bash Create monitoring script cat << 'EOF' > /usr/local/bin/mdadm-monitor.sh #!/bin/bash ARRAYS=$(cat /proc/mdstat | grep "^md" | cut -d: -f1) for ARRAY in $ARRAYS; do STATUS=$(mdadm --detail /dev/$ARRAY | grep "State :" | awk '{print $3}') if [[ "$STATUS" != "clean" ]]; then echo "ALERT: Array /dev/$ARRAY status is $STATUS" # Send notification (email, slack, etc.) fi done EOF chmod +x /usr/local/bin/mdadm-monitor.sh Add to crontab for regular monitoring echo "/5 * /usr/local/bin/mdadm-monitor.sh" | crontab - ``` Monitoring and Verification Real-Time Monitoring Commands ```bash Continuous monitoring during operations watch -n 1 'echo "=== /proc/mdstat ==="; cat /proc/mdstat; echo; echo "=== Array Details ==="; mdadm --detail /dev/md0 | grep -E "(State|Recovery|Rebuild)"' Monitor I/O performance iostat -x 1 | grep -E "(Device|md0|sd[bc]1)" Check system resources htop -p $(pgrep md) ``` Logging and Alerting ```bash Enable mdadm monitoring daemon sudo systemctl enable mdmonitor sudo systemctl start mdmonitor Configure email alerts in /etc/mdadm/mdadm.conf echo "MAILADDR admin@example.com" | sudo tee -a /etc/mdadm/mdadm.conf Test alert system sudo mdadm --monitor --test /dev/md0 ``` Health Check Scripts Create comprehensive health check scripts: ```bash #!/bin/bash mdadm-health-check.sh echo "=== MDADM Health Check Report ===" echo "Date: $(date)" echo echo "=== Array Status ===" cat /proc/mdstat echo echo "=== Detailed Array Information ===" for array in $(ls /dev/md* 2>/dev/null | grep -E 'md[0-9]+$'); do echo "--- $array ---" mdadm --detail $array | grep -E "(State|Active|Working|Failed|Spare)" echo done echo "=== Recent RAID Events ===" journalctl -u mdmonitor --since "24 hours ago" --no-pager | tail -20 echo "=== Disk Health ===" for disk in $(lsblk -d -o NAME | grep -E '^sd[a-z]$'); do echo "--- /dev/$disk ---" smartctl -H /dev/$disk 2>/dev/null | grep -E "(SMART|overall-health)" || echo "SMART not available" done echo "=== End Report ===" ``` Conclusion Successfully managing mdadm RAID arrays requires understanding the fail, remove, and add operations sequence. The command `mdadm /dev/md0 --fail /dev/sdb1 --remove /dev/sdb1 --add /dev/sdb1` is a powerful tool for disk maintenance, testing, and troubleshooting scenarios. Key Takeaways 1. Always verify array status before and after operations 2. Monitor rebuild progress actively to catch issues early 3. Maintain backups as a safety net during maintenance 4. Use scripting for consistent and repeatable operations 5. Implement monitoring for proactive array management Next Steps After mastering these basic operations, consider: - Implementing automated monitoring and alerting systems - Learning advanced mdadm features like bitmap management - Exploring integration with configuration management tools - Setting up comprehensive backup and disaster recovery procedures - Studying performance tuning for specific workloads Additional Resources - mdadm man page: `man mdadm` - Linux RAID Wiki: Comprehensive documentation and examples - System logs: `/var/log/syslog`, `journalctl -u mdmonitor` - Kernel documentation: `/usr/share/doc/mdadm/` Remember that RAID is not a substitute for proper backups, and regular testing of both RAID functionality and backup procedures is essential for maintaining data integrity in production environments.