How to mark disk failed/add → mdadm /dev/md0 --fail /dev/sdb1 --remove /dev/sdb1 --add /dev/sdb1
How to Mark Disk Failed, Remove, and Re-add in mdadm RAID Arrays
Table of Contents
- [Introduction](#introduction)
- [Prerequisites](#prerequisites)
- [Understanding mdadm Operations](#understanding-mdadm-operations)
- [Step-by-Step Guide](#step-by-step-guide)
- [Practical Examples](#practical-examples)
- [Common Use Cases](#common-use-cases)
- [Troubleshooting](#troubleshooting)
- [Best Practices](#best-practices)
- [Advanced Scenarios](#advanced-scenarios)
- [Monitoring and Verification](#monitoring-and-verification)
- [Conclusion](#conclusion)
Introduction
Managing RAID arrays with mdadm (Multiple Device Administration) is a critical skill for Linux system administrators. One of the most important maintenance tasks involves marking disks as failed, removing them from the array, and re-adding them when necessary. This comprehensive guide will walk you through the complete process of using the mdadm command sequence: `mdadm /dev/md0 --fail /dev/sdb1 --remove /dev/sdb1 --add /dev/sdb1`.
This operation is commonly performed during disk replacement, testing scenarios, or when troubleshooting RAID array issues. Understanding these operations is essential for maintaining data integrity and ensuring optimal RAID performance in production environments.
Prerequisites
Before proceeding with mdadm disk operations, ensure you have the following:
System Requirements
- Linux system with mdadm installed
- Root or sudo privileges
- Active RAID array (md device)
- Basic understanding of RAID concepts
Software Requirements
```bash
Verify mdadm installation
which mdadm
mdadm --version
Install mdadm if not present (Ubuntu/Debian)
sudo apt-get update
sudo apt-get install mdadm
Install mdadm if not present (CentOS/RHEL)
sudo yum install mdadm
or for newer versions
sudo dnf install mdadm
```
Safety Considerations
- Always backup critical data before performing disk operations
- Ensure the RAID array has redundancy (not RAID 0)
- Verify array status before making changes
- Have replacement hardware ready if needed
Understanding mdadm Operations
The Three-Step Process
The command `mdadm /dev/md0 --fail /dev/sdb1 --remove /dev/sdb1 --add /dev/sdb1` performs three distinct operations in sequence:
1. --fail: Marks the specified device as failed
2. --remove: Removes the failed device from the array
3. --add: Adds the device back to the array
Why This Sequence Matters
This sequence is crucial because:
- mdadm won't remove a functioning disk without marking it as failed first
- The system needs to recognize the disk as problematic before removal
- Re-adding triggers the rebuild process for data synchronization
RAID Level Compatibility
This operation works with:
- RAID 1 (Mirror): Fully supported, no data loss risk
- RAID 5: Supported with one disk failure tolerance
- RAID 6: Supported with two disk failure tolerance
- RAID 10: Supported depending on which disks fail
- RAID 0: Not recommended (no redundancy)
Step-by-Step Guide
Step 1: Check Array Status
Before making any changes, examine the current array status:
```bash
Check detailed array information
sudo mdadm --detail /dev/md0
Quick status check
cat /proc/mdstat
Check all arrays
sudo mdadm --detail --scan
```
Expected output example:
```
/dev/md0:
Version : 1.2
Creation Time : Mon Jan 15 10:30:25 2024
Raid Level : raid1
Array Size : 1048576 (1024.00 MiB 1073.74 MB)
Used Dev Size : 1048576 (1024.00 MiB 1073.74 MB)
Raid Devices : 2
Total Devices : 2
Persistence : Superblock is persistent
Update Time : Mon Jan 15 11:45:30 2024
State : clean
Active Devices : 2
Working Devices : 2
Failed Devices : 0
Spare Devices : 0
Name : server1:0
UUID : 12345678:abcdefgh:ijklmnop:qrstuvwx
Events : 17
Number Major Minor RaidDevice State
0 8 17 0 active sync /dev/sdb1
1 8 33 1 active sync /dev/sdc1
```
Step 2: Execute the Fail Operation
Mark the target disk as failed:
```bash
sudo mdadm /dev/md0 --fail /dev/sdb1
```
Verify the operation:
```bash
sudo mdadm --detail /dev/md0 | grep -E "(State|Failed)"
```
You should see:
```
State : clean, degraded
Failed Devices : 1
1 8 17 - faulty /dev/sdb1
```
Step 3: Remove the Failed Disk
Remove the failed disk from the array:
```bash
sudo mdadm /dev/md0 --remove /dev/sdb1
```
Confirmation output:
```
mdadm: hot removed /dev/sdb1 from /dev/md0
```
Verify removal:
```bash
sudo mdadm --detail /dev/md0
```
The device should no longer appear in the active devices list.
Step 4: Re-add the Disk
Add the disk back to the array:
```bash
sudo mdadm /dev/md0 --add /dev/sdb1
```
Confirmation output:
```
mdadm: added /dev/sdb1
```
Step 5: Monitor the Rebuild Process
After re-adding, the array will begin rebuilding:
```bash
Watch the rebuild progress
watch cat /proc/mdstat
Or check periodically
sudo mdadm --detail /dev/md0
```
During rebuild, you'll see:
```
Personalities : [raid1]
md0 : active raid1 sdb1[2] sdc1[1]
1048576 blocks super 1.2 [2/1] [_U]
[==>..................] recovery = 12.5% (131072/1048576) finish=0.8min speed=131072K/sec
```
Practical Examples
Example 1: Complete Single Command Execution
Execute all three operations in one command:
```bash
Single command execution
sudo mdadm /dev/md0 --fail /dev/sdb1 --remove /dev/sdb1 --add /dev/sdb1
Verify the operation
sudo mdadm --detail /dev/md0
```
Example 2: Scripted Approach with Verification
```bash
#!/bin/bash
ARRAY="/dev/md0"
DEVICE="/dev/sdb1"
echo "Starting disk maintenance on $DEVICE in array $ARRAY"
Check initial status
echo "Initial array status:"
mdadm --detail $ARRAY | grep -E "(State|Active|Failed)"
Fail the device
echo "Marking $DEVICE as failed..."
mdadm $ARRAY --fail $DEVICE
sleep 2
Remove the device
echo "Removing $DEVICE..."
mdadm $ARRAY --remove $DEVICE
sleep 2
Re-add the device
echo "Re-adding $DEVICE..."
mdadm $ARRAY --add $DEVICE
echo "Operation complete. Monitoring rebuild..."
watch -n 5 'cat /proc/mdstat'
```
Example 3: Multiple Disk Operations
For arrays with multiple disks needing attention:
```bash
Define array and devices
ARRAY="/dev/md0"
DEVICES=("/dev/sdb1" "/dev/sdc1")
for DEVICE in "${DEVICES[@]}"; do
echo "Processing $DEVICE..."
sudo mdadm $ARRAY --fail $DEVICE --remove $DEVICE --add $DEVICE
# Wait for rebuild to start before processing next disk
sleep 10
done
```
Common Use Cases
Disk Testing and Validation
Testing disk reliability by simulating failures:
```bash
Test scenario: simulate disk failure and recovery
sudo mdadm /dev/md0 --fail /dev/sdb1 --remove /dev/sdb1
Perform disk diagnostics
sudo badblocks -v /dev/sdb1
Re-add if disk passes tests
sudo mdadm /dev/md0 --add /dev/sdb1
```
Forced Rebuild for Performance
Sometimes forcing a rebuild can improve performance:
```bash
Force rebuild of a specific disk
sudo mdadm /dev/md0 --fail /dev/sdb1 --remove /dev/sdb1 --add /dev/sdb1
Monitor rebuild performance
iostat -x 1 | grep -E "(Device|md0|sdb1)"
```
Disk Replacement Preparation
Preparing for physical disk replacement:
```bash
Step 1: Mark as failed and remove
sudo mdadm /dev/md0 --fail /dev/sdb1 --remove /dev/sdb1
Step 2: Physical replacement (power down if necessary)
Replace physical disk
Step 3: Partition new disk to match
sudo sfdisk -d /dev/sdc | sudo sfdisk /dev/sdb
Step 4: Add new disk to array
sudo mdadm /dev/md0 --add /dev/sdb1
```
Troubleshooting
Common Error Messages and Solutions
Error: "Device or resource busy"
```bash
Error message
mdadm: cannot remove /dev/sdb1: Device or resource busy
Solution: Check what's using the device
lsof | grep sdb1
fuser -v /dev/sdb1
Force removal if safe
sudo mdadm /dev/md0 --remove /dev/sdb1 --force
```
Error: "No such device"
```bash
Error message
mdadm: cannot find /dev/sdb1: No such device
Solution: Verify device exists and check naming
ls -la /dev/sd*
lsblk
Use correct device path
sudo mdadm /dev/md0 --fail /dev/sdb --remove /dev/sdb --add /dev/sdb
```
Error: "Device already exists in array"
```bash
Error message
mdadm: /dev/sdb1 already exists in array
Solution: Check current array status
sudo mdadm --detail /dev/md0
Remove first, then re-add
sudo mdadm /dev/md0 --remove /dev/sdb1
sudo mdadm /dev/md0 --add /dev/sdb1
```
Rebuild Issues
Slow Rebuild Performance
```bash
Check current rebuild speed limits
cat /proc/sys/dev/raid/speed_limit_min
cat /proc/sys/dev/raid/speed_limit_max
Increase rebuild speed (temporarily)
echo 50000 | sudo tee /proc/sys/dev/raid/speed_limit_min
echo 200000 | sudo tee /proc/sys/dev/raid/speed_limit_max
```
Rebuild Stuck or Failed
```bash
Check for errors
dmesg | grep -i raid
journalctl -u mdmonitor
Force rebuild restart
sudo mdadm --stop /dev/md0
sudo mdadm --assemble /dev/md0 /dev/sdb1 /dev/sdc1
```
Array Won't Start After Operations
```bash
Examine array components
sudo mdadm --examine /dev/sdb1
sudo mdadm --examine /dev/sdc1
Force assembly if metadata is intact
sudo mdadm --assemble --force /dev/md0 /dev/sdb1 /dev/sdc1
Scan and assemble all arrays
sudo mdadm --assemble --scan
```
Best Practices
Pre-Operation Checks
1. Verify Array Health
```bash
# Comprehensive health check
sudo mdadm --detail /dev/md0
sudo mdadm --examine /dev/sdb1
cat /proc/mdstat
```
2. Check System Load
```bash
# Ensure system isn't under heavy load
uptime
iostat 1 5
```
3. Backup Critical Data
```bash
# Create backup before operations
rsync -av /mnt/raid/ /backup/location/
```
During Operations
1. Monitor Progress Actively
```bash
# Real-time monitoring
watch -n 2 'cat /proc/mdstat; echo "---"; mdadm --detail /dev/md0 | grep -E "(State|Recovery)"'
```
2. Log Operations
```bash
# Log all operations
sudo mdadm /dev/md0 --fail /dev/sdb1 --remove /dev/sdb1 --add /dev/sdb1 | tee -a /var/log/mdadm-operations.log
```
Post-Operation Verification
1. Verify Array Integrity
```bash
# Check array consistency
echo check > /sys/block/md0/md/sync_action
cat /sys/block/md0/md/mismatch_cnt
```
2. Update Configuration
```bash
# Update mdadm configuration
sudo mdadm --detail --scan >> /etc/mdadm/mdadm.conf
sudo update-initramfs -u
```
Performance Optimization
1. Optimize Rebuild Speed
```bash
# Set appropriate rebuild speeds
echo 10000 > /proc/sys/dev/raid/speed_limit_min
echo 100000 > /proc/sys/dev/raid/speed_limit_max
```
2. Schedule Operations During Low Usage
```bash
# Use cron for scheduled maintenance
# 0 2 0 /usr/local/bin/raid-maintenance.sh
```
Advanced Scenarios
Hot Spare Configuration
Configure automatic failover with hot spares:
```bash
Add hot spare
sudo mdadm /dev/md0 --add /dev/sdd1
Verify spare is available
sudo mdadm --detail /dev/md0 | grep spare
Test automatic failover
sudo mdadm /dev/md0 --fail /dev/sdb1
Hot spare should automatically activate
```
Multi-Array Operations
Managing multiple arrays simultaneously:
```bash
#!/bin/bash
ARRAYS=("/dev/md0" "/dev/md1" "/dev/md2")
DEVICES=("/dev/sdb1" "/dev/sdc1" "/dev/sdd1")
for i in "${!ARRAYS[@]}"; do
ARRAY="${ARRAYS[$i]}"
DEVICE="${DEVICES[$i]}"
echo "Processing $ARRAY with device $DEVICE"
mdadm $ARRAY --fail $DEVICE --remove $DEVICE --add $DEVICE
# Wait between arrays to manage system load
sleep 30
done
```
Integration with Monitoring Systems
Set up automated monitoring and alerting:
```bash
Create monitoring script
cat << 'EOF' > /usr/local/bin/mdadm-monitor.sh
#!/bin/bash
ARRAYS=$(cat /proc/mdstat | grep "^md" | cut -d: -f1)
for ARRAY in $ARRAYS; do
STATUS=$(mdadm --detail /dev/$ARRAY | grep "State :" | awk '{print $3}')
if [[ "$STATUS" != "clean" ]]; then
echo "ALERT: Array /dev/$ARRAY status is $STATUS"
# Send notification (email, slack, etc.)
fi
done
EOF
chmod +x /usr/local/bin/mdadm-monitor.sh
Add to crontab for regular monitoring
echo "/5 * /usr/local/bin/mdadm-monitor.sh" | crontab -
```
Monitoring and Verification
Real-Time Monitoring Commands
```bash
Continuous monitoring during operations
watch -n 1 'echo "=== /proc/mdstat ==="; cat /proc/mdstat; echo; echo "=== Array Details ==="; mdadm --detail /dev/md0 | grep -E "(State|Recovery|Rebuild)"'
Monitor I/O performance
iostat -x 1 | grep -E "(Device|md0|sd[bc]1)"
Check system resources
htop -p $(pgrep md)
```
Logging and Alerting
```bash
Enable mdadm monitoring daemon
sudo systemctl enable mdmonitor
sudo systemctl start mdmonitor
Configure email alerts in /etc/mdadm/mdadm.conf
echo "MAILADDR admin@example.com" | sudo tee -a /etc/mdadm/mdadm.conf
Test alert system
sudo mdadm --monitor --test /dev/md0
```
Health Check Scripts
Create comprehensive health check scripts:
```bash
#!/bin/bash
mdadm-health-check.sh
echo "=== MDADM Health Check Report ==="
echo "Date: $(date)"
echo
echo "=== Array Status ==="
cat /proc/mdstat
echo
echo "=== Detailed Array Information ==="
for array in $(ls /dev/md* 2>/dev/null | grep -E 'md[0-9]+$'); do
echo "--- $array ---"
mdadm --detail $array | grep -E "(State|Active|Working|Failed|Spare)"
echo
done
echo "=== Recent RAID Events ==="
journalctl -u mdmonitor --since "24 hours ago" --no-pager | tail -20
echo "=== Disk Health ==="
for disk in $(lsblk -d -o NAME | grep -E '^sd[a-z]$'); do
echo "--- /dev/$disk ---"
smartctl -H /dev/$disk 2>/dev/null | grep -E "(SMART|overall-health)" || echo "SMART not available"
done
echo "=== End Report ==="
```
Conclusion
Successfully managing mdadm RAID arrays requires understanding the fail, remove, and add operations sequence. The command `mdadm /dev/md0 --fail /dev/sdb1 --remove /dev/sdb1 --add /dev/sdb1` is a powerful tool for disk maintenance, testing, and troubleshooting scenarios.
Key Takeaways
1. Always verify array status before and after operations
2. Monitor rebuild progress actively to catch issues early
3. Maintain backups as a safety net during maintenance
4. Use scripting for consistent and repeatable operations
5. Implement monitoring for proactive array management
Next Steps
After mastering these basic operations, consider:
- Implementing automated monitoring and alerting systems
- Learning advanced mdadm features like bitmap management
- Exploring integration with configuration management tools
- Setting up comprehensive backup and disaster recovery procedures
- Studying performance tuning for specific workloads
Additional Resources
- mdadm man page: `man mdadm`
- Linux RAID Wiki: Comprehensive documentation and examples
- System logs: `/var/log/syslog`, `journalctl -u mdmonitor`
- Kernel documentation: `/usr/share/doc/mdadm/`
Remember that RAID is not a substitute for proper backups, and regular testing of both RAID functionality and backup procedures is essential for maintaining data integrity in production environments.