How to check disk health in Linux
How to Check Disk Health in Linux
Maintaining disk health is crucial for system reliability and data protection. Linux provides numerous built-in tools and utilities to monitor, analyze, and diagnose storage device health. This comprehensive guide will walk you through various methods to check disk health, from basic system commands to advanced SMART monitoring tools.
Why Disk Health Monitoring Matters
Storage devices are mechanical components (in the case of HDDs) or have limited write cycles (SSDs), making them prone to failure over time. Regular disk health monitoring helps you:
- Prevent data loss by identifying failing drives early
- Optimize system performance by detecting slow or problematic sectors
- Plan hardware replacements before critical failures occur
- Maintain system uptime through proactive maintenance
Table of Contents
1. [Prerequisites and Preparation](#prerequisites-and-preparation)
2. [Using SMART Tools for Health Monitoring](#using-smart-tools-for-health-monitoring)
3. [File System Checking with fsck](#file-system-checking-with-fsck)
4. [Bad Block Detection with badblocks](#bad-block-detection-with-badblocks)
5. [System Log Analysis](#system-log-analysis)
6. [GUI Tools for Disk Health](#gui-tools-for-disk-health)
7. [Creating Health Monitoring Scripts](#creating-health-monitoring-scripts)
8. [Troubleshooting Common Issues](#troubleshooting-common-issues)
9. [Best Practices for Disk Health Monitoring](#best-practices-for-disk-health-monitoring)
10. [Performance Impact Considerations](#performance-impact-considerations)
11. [Enterprise vs Desktop Monitoring](#enterprise-vs-desktop-monitoring)
12. [Conclusion](#conclusion)
Prerequisites and Preparation
Before checking disk health, ensure you have:
- Root or sudo privileges for most diagnostic commands
- Knowledge of your disk layout using `lsblk` or `fdisk -l`
- Backup of critical data before running invasive tests
Identifying Your Disks
First, list all available storage devices:
```bash
List block devices
lsblk
Show detailed disk information
sudo fdisk -l
Display disk usage
df -h
```
Example output:
```
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
sda 8:0 0 500G 0 disk
├─sda1 8:1 0 512M 0 part /boot/efi
└─sda2 8:2 0 499G 0 part /
```
Using SMART Tools for Health Monitoring
Self-Monitoring, Analysis, and Reporting Technology (SMART) is the most comprehensive method for checking disk health in Linux. SMART provides detailed information about drive performance, error rates, and predictive failure indicators.
Installing smartmontools
Install the smartmontools package on your distribution:
```bash
Ubuntu/Debian
sudo apt update && sudo apt install smartmontools
CentOS/RHEL/Fedora
sudo dnf install smartmontools
or for older versions
sudo yum install smartmontools
Arch Linux
sudo pacman -S smartmontools
```
Basic SMART Commands
Checking SMART Capability
Verify if your drive supports SMART:
```bash
sudo smartctl -i /dev/sda
```
Example output:
```
=== START OF INFORMATION SECTION ===
Model Family: Western Digital Blue
Device Model: WDC WD5000AZRX-00A8LB0
Serial Number: WD-WCC2E0123456
User Capacity: 500,107,862,016 bytes [500 GB]
Sector Size: 512 bytes logical/physical
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
```
Enabling SMART
Enable SMART monitoring if it's not already active:
```bash
sudo smartctl -s on /dev/sda
```
Quick Health Check
Perform a quick overall health assessment:
```bash
sudo smartctl -H /dev/sda
```
Expected output for a healthy drive:
```
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
```
Detailed SMART Attributes
View comprehensive SMART data:
```bash
sudo smartctl -A /dev/sda
```
Key attributes to monitor:
| Attribute ID | Name | Critical Values | Description |
|-------------|------|----------------|-------------|
| 5 | Reallocated_Sector_Ct | >0 | Bad sectors remapped |
| 9 | Power_On_Hours | Monitor trend | Drive usage time |
| 10 | Spin_Retry_Count | >0 | HDD spin-up failures |
| 187 | Reported_Uncorrect | >0 | Uncorrectable errors |
| 188 | Command_Timeout | >0 | Command timeouts |
| 196 | Reallocated_Event_Count | >0 | Reallocation events |
| 197 | Current_Pending_Sector | >0 | Sectors waiting for reallocation |
| 198 | Offline_Uncorrectable | >0 | Uncorrectable offline errors |
SMART Self-Tests
SMART supports several built-in self-tests:
Short Self-Test (1-2 minutes)
```bash
sudo smartctl -t short /dev/sda
```
Extended Self-Test (hours, depending on drive size)
```bash
sudo smartctl -t long /dev/sda
```
Checking Test Results
Monitor test progress and results:
```bash
Check current test status
sudo smartctl -c /dev/sda
View test results
sudo smartctl -l selftest /dev/sda
```
Example test results:
```
=== START OF READ SMART DATA SECTION ===
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
1 Short offline Completed without error 00% 1234 -
2 Extended offline Completed without error 00% 1200 -
```
File System Checking with fsck
The `fsck` (file system check) utility examines and repairs file system inconsistencies. It's essential for maintaining file system integrity.
Basic fsck Usage
Important: Always unmount file systems before checking, except for read-only checks.
```bash
Check file system read-only (mounted system)
sudo fsck -n /dev/sda2
Check and repair unmounted file system
sudo fsck -f /dev/sda2
Force check even if file system appears clean
sudo fsck -f /dev/sda2
```
File System-Specific Checks
ext4 File Systems
```bash
Check ext4 file system
sudo e2fsck -f /dev/sda2
Verbose output with progress
sudo e2fsck -v -f /dev/sda2
Check and attempt automatic repairs
sudo e2fsck -p /dev/sda2
```
XFS File Systems
```bash
Check XFS file system (must be mounted)
sudo xfs_repair -n /dev/sda2
Repair XFS file system (unmounted)
sudo xfs_repair /dev/sda2
```
Understanding fsck Output
Common fsck messages and their meanings:
- "clean": File system is healthy
- "errors corrected": Minor issues were fixed
- "UNEXPECTED INCONSISTENCY": Serious problems requiring attention
Bad Block Detection with badblocks
The `badblocks` utility performs low-level testing to identify physically damaged areas on storage devices.
Non-Destructive Read Test
```bash
Read-only scan for bad blocks
sudo badblocks -v /dev/sda
Read-only scan with progress
sudo badblocks -s -v /dev/sda
```
Read-Write Test (Destructive)
Warning: This test destroys data on the tested device.
```bash
Destructive read-write test
sudo badblocks -w -s -v /dev/sda
```
Non-Destructive Read-Write Test
```bash
Non-destructive read-write test
sudo badblocks -n -s -v /dev/sda
```
Saving Bad Block Lists
```bash
Save bad blocks to file
sudo badblocks -v /dev/sda > bad-blocks-sda.txt
Use saved bad blocks with e2fsck
sudo e2fsck -l bad-blocks-sda.txt /dev/sda2
```
System Log Analysis
System logs contain valuable information about disk errors and hardware issues.
Checking dmesg for Disk Errors
```bash
View recent kernel messages about disks
dmesg | grep -i "error\|fail\|warn" | grep -i "sd\|ata"
Monitor real-time disk messages
dmesg -w | grep -i "sd\|ata"
Check for specific disk
dmesg | grep sda
```
System Log Files
Examine system logs for disk-related issues:
```bash
Check system logs
sudo journalctl -u systemd-fsck@dev-sda2.service
Search for disk errors in logs
sudo grep -i "i/o error\|disk error\|ata" /var/log/syslog
Real-time log monitoring
sudo tail -f /var/log/syslog | grep -i disk
```
GUI Tools for Disk Health
GNOME Disks (gnome-disk-utility)
Install and use GNOME Disks for graphical disk management:
```bash
Install GNOME Disks
sudo apt install gnome-disk-utility
Launch graphical interface
gnome-disks
```
Features include:
- SMART data visualization
- Disk benchmarking
- Partition management
- Health status overview
GSmartControl
A graphical frontend for smartmontools:
```bash
Install GSmartControl
sudo apt install gsmartcontrol
Launch application
gsmartcontrol
```
Creating Health Monitoring Scripts
Automate disk health monitoring with custom scripts:
Basic Health Check Script
```bash
#!/bin/bash
disk-health-check.sh
LOG_FILE="/var/log/disk-health.log"
EMAIL="admin@example.com"
echo "$(date): Starting disk health check" >> $LOG_FILE
for disk in $(lsblk -d -n -o NAME | grep -E '^sd|^nvme'); do
echo "Checking /dev/$disk..." >> $LOG_FILE
# SMART health check
SMART_STATUS=$(sudo smartctl -H /dev/$disk | grep "SMART overall-health")
if [[ $SMART_STATUS == "PASSED" ]]; then
echo "/dev/$disk: HEALTHY" >> $LOG_FILE
else
echo "/dev/$disk: WARNING - Check required!" >> $LOG_FILE
echo "Disk /dev/$disk failed health check" | mail -s "Disk Health Alert" $EMAIL
fi
# Check for reallocated sectors
REALLOCATED=$(sudo smartctl -A /dev/$disk | grep "Reallocated_Sector_Ct" | awk '{print $10}')
if [[ $REALLOCATED -gt 0 ]]; then
echo "/dev/$disk: $REALLOCATED reallocated sectors detected" >> $LOG_FILE
fi
done
echo "$(date): Disk health check completed" >> $LOG_FILE
```
Advanced Monitoring Script
```bash
#!/bin/bash
advanced-disk-monitor.sh
THRESHOLD_TEMP=50
THRESHOLD_REALLOCATED=10
ALERT_EMAIL="sysadmin@company.com"
check_disk_health() {
local disk=$1
local issues=()
# Temperature check
TEMP=$(sudo smartctl -A /dev/$disk | grep "Temperature_Celsius" | awk '{print $10}')
if [[ $TEMP -gt $THRESHOLD_TEMP ]]; then
issues+=("High temperature: ${TEMP}°C")
fi
# Reallocated sectors
REALLOCATED=$(sudo smartctl -A /dev/$disk | grep "Reallocated_Sector_Ct" | awk '{print $10}')
if [[ $REALLOCATED -gt $THRESHOLD_REALLOCATED ]]; then
issues+=("Reallocated sectors: $REALLOCATED")
fi
# Pending sectors
PENDING=$(sudo smartctl -A /dev/$disk | grep "Current_Pending_Sector" | awk '{print $10}')
if [[ $PENDING -gt 0 ]]; then
issues+=("Pending sectors: $PENDING")
fi
# Report issues
if [[ ${#issues[@]} -gt 0 ]]; then
echo "ALERT: /dev/$disk has issues:"
printf '%s\n' "${issues[@]}"
printf '%s\n' "${issues[@]}" | mail -s "Disk Alert: /dev/$disk" $ALERT_EMAIL
fi
}
Check all disks
for disk in $(lsblk -d -n -o NAME | grep -E '^sd|^nvme'); do
check_disk_health $disk
done
```
Automated Cron Job Setup
Schedule regular health checks:
```bash
Edit crontab
crontab -e
Add daily health check at 2 AM
0 2 * /usr/local/bin/disk-health-check.sh
Add weekly comprehensive check on Sundays at 3 AM
0 3 0 /usr/local/bin/advanced-disk-monitor.sh
```
Troubleshooting Common Issues
Issue: "SMART command failed"
Symptoms: SMART commands return errors or "UNAVAILABLE" status.
Solutions:
```bash
Check if drive supports SMART
sudo smartctl -i /dev/sda
Try enabling SMART
sudo smartctl -s on /dev/sda
For USB/external drives, use different options
sudo smartctl -d sat -H /dev/sdb
```
Issue: "Device or resource busy" during fsck
Symptoms: Cannot run fsck because file system is mounted.
Solutions:
```bash
Identify what's using the device
sudo lsof /dev/sda2
sudo fuser -v /dev/sda2
Unmount the file system
sudo umount /dev/sda2
For root file system, use single-user mode or live USB
```
Issue: High number of reallocated sectors
Symptoms: SMART attribute 5 (Reallocated_Sector_Ct) shows increasing values.
Actions:
1. Backup data immediately
2. Run extended SMART test
3. Monitor trend over time
4. Plan for drive replacement
```bash
Monitor reallocated sectors over time
sudo smartctl -A /dev/sda | grep "Reallocated_Sector_Ct"
Run extended test
sudo smartctl -t long /dev/sda
```
Issue: SSD-specific concerns
For SSD health monitoring:
```bash
Check SSD-specific attributes
sudo smartctl -A /dev/sda | grep -E "Wear_Leveling|Program_Fail|Erase_Fail"
Check total bytes written (SSD endurance)
sudo smartctl -A /dev/sda | grep "Total_LBAs_Written"
```
Issue: False positives in health checks
Common causes and solutions:
1. Old drives with expected wear: Adjust monitoring thresholds
2. External USB drives: Use appropriate SMART options (`-d sat`)
3. RAID configurations: Check individual drives and RAID status
Best Practices for Disk Health Monitoring
Regular Monitoring Schedule
- Daily: Quick SMART health checks
- Weekly: File system checks on unmounted partitions
- Monthly: Extended SMART self-tests
- Quarterly: Bad block scans on critical systems
Proactive Maintenance
1. Keep firmware updated on storage devices
2. Maintain adequate free space (>10% for optimal performance)
3. Monitor temperature and ensure proper cooling
4. Regular backups - The most important protection against disk failure
5. Document drive history - Track age, workload, and previous issues
Warning Signs to Watch For
Critical SMART Attributes
Monitor these attributes closely:
```bash
Create a monitoring script for critical attributes
#!/bin/bash
CRITICAL_ATTRS=(5 187 188 196 197 198)
for disk in /dev/sd?; do
for attr in "${CRITICAL_ATTRS[@]}"; do
VALUE=$(sudo smartctl -A $disk | awk -v attr=$attr '$1==attr {print $10}')
if [[ $VALUE -gt 0 ]]; then
echo "WARNING: $disk attribute $attr = $VALUE"
fi
done
done
```
Performance Degradation Signs
- Increased read/write response times
- Frequent I/O errors in system logs
- Applications hanging during disk operations
- Unusual drive noises (clicking, grinding)
Environment Considerations
Temperature Management
```bash
Monitor drive temperatures
sudo smartctl -A /dev/sda | grep Temperature
Set up temperature alerts
TEMP_THRESHOLD=50
CURRENT_TEMP=$(sudo smartctl -A /dev/sda | grep "Temperature_Celsius" | awk '{print $10}')
if [[ $CURRENT_TEMP -gt $TEMP_THRESHOLD ]]; then
echo "Drive temperature critical: ${CURRENT_TEMP}°C" | mail -s "Temperature Alert" admin@domain.com
fi
```
Power Management
For servers and critical systems:
- Use UPS (Uninterruptible Power Supply)
- Monitor power-on hours and power cycle counts
- Consider drive rotation schedules for high-usage systems
Performance Impact Considerations
Minimizing Impact During Health Checks
Scheduling Tests During Low Activity
```bash
Check system load before running intensive tests
LOAD=$(uptime | awk -F'load average:' '{print $2}' | awk '{print $1}' | sed 's/,//')
if (( $(echo "$LOAD < 1.0" | bc -l) )); then
# Run intensive disk tests
sudo smartctl -t long /dev/sda
else
echo "System load too high, skipping test"
fi
```
Using ionice for Disk Operations
Control I/O priority for disk checking operations:
```bash
Run badblocks with low I/O priority
sudo ionice -c 3 badblocks -s -v /dev/sda
Run fsck with idle I/O priority
sudo ionice -c 3 nice -n 19 fsck -n /dev/sda2
```
Balancing Thoroughness with Performance
Quick vs. Comprehensive Checks
Daily Quick Checks:
```bash
Fast health overview
sudo smartctl -H /dev/sda
sudo dmesg | tail -20 | grep -i error
```
Weekly Detailed Checks:
```bash
More thorough analysis
sudo smartctl -a /dev/sda
sudo fsck -n /dev/sda2
```
Monthly Deep Analysis:
```bash
Complete assessment
sudo smartctl -t long /dev/sda
sudo badblocks -s -v /dev/sda
```
Enterprise vs Desktop Monitoring
Desktop/Personal Systems
For personal computers and workstations:
Simplified Monitoring Approach
```bash
#!/bin/bash
desktop-disk-monitor.sh
Simple script for desktop users
echo "=== Disk Health Summary ==="
for disk in $(lsblk -d -n -o NAME | grep -E '^sd|^nvme'); do
echo -n "/dev/$disk: "
HEALTH=$(sudo smartctl -H /dev/$disk 2>/dev/null | grep "overall-health" | awk '{print $6}')
if [[ "$HEALTH" == "PASSED" ]]; then
echo "✓ HEALTHY"
else
echo "⚠ NEEDS ATTENTION"
# Show critical attributes
sudo smartctl -A /dev/$disk | grep -E "Reallocated_Sector_Ct|Current_Pending_Sector"
fi
done
```
GUI Integration
For desktop environments, integrate with notification systems:
```bash
Desktop notification script
#!/bin/bash
HEALTH_ISSUES=0
for disk in /dev/sd?; do
if ! sudo smartctl -H $disk | grep -q "PASSED"; then
HEALTH_ISSUES=$((HEALTH_ISSUES + 1))
fi
done
if [[ $HEALTH_ISSUES -gt 0 ]]; then
notify-send "Disk Health Warning" "$HEALTH_ISSUES drive(s) need attention" -u critical
fi
```
Enterprise/Server Systems
For production servers and enterprise environments:
Comprehensive Monitoring Framework
```bash
#!/bin/bash
enterprise-disk-monitor.sh
Comprehensive monitoring for servers
CONFIG_FILE="/etc/disk-monitor.conf"
LOG_FILE="/var/log/disk-health.log"
ALERT_EMAIL="sysadmin@company.com"
SNMP_TRAP="192.168.1.100"
source $CONFIG_FILE 2>/dev/null || {
# Default configuration
TEMP_THRESHOLD=45
REALLOCATED_THRESHOLD=5
PENDING_THRESHOLD=1
}
log_message() {
echo "$(date '+%Y-%m-%d %H:%M:%S'): $1" >> $LOG_FILE
}
send_alert() {
local severity=$1
local message=$2
# Email alert
echo "$message" | mail -s "[$severity] Disk Health Alert" $ALERT_EMAIL
# SNMP trap (if configured)
if [[ -n "$SNMP_TRAP" ]]; then
snmptrap -v2c -c public $SNMP_TRAP '' 1.3.6.1.4.1.1 \
1.3.6.1.4.1.1.1 s "$severity" \
1.3.6.1.4.1.1.2 s "$message"
fi
# Syslog
logger -p local0.err "DISK_HEALTH: $severity - $message"
}
check_enterprise_disk() {
local disk=$1
local model=$(sudo smartctl -i $disk | grep "Device Model" | cut -d: -f2 | xargs)
local serial=$(sudo smartctl -i $disk | grep "Serial Number" | cut -d: -f2 | xargs)
log_message "Checking $disk ($model, S/N: $serial)"
# Critical health check
if ! sudo smartctl -H $disk | grep -q "PASSED"; then
send_alert "CRITICAL" "Drive $disk failed SMART health check"
return 1
fi
# Detailed attribute analysis
local smartdata=$(sudo smartctl -A $disk)
# Temperature check
local temp=$(echo "$smartdata" | grep "Temperature_Celsius" | awk '{print $10}')
if [[ $temp -gt $TEMP_THRESHOLD ]]; then
send_alert "WARNING" "Drive $disk temperature high: ${temp}°C"
fi
# Reallocated sectors
local reallocated=$(echo "$smartdata" | grep "Reallocated_Sector_Ct" | awk '{print $10}')
if [[ $reallocated -gt $REALLOCATED_THRESHOLD ]]; then
send_alert "WARNING" "Drive $disk has $reallocated reallocated sectors"
fi
# Pending sectors
local pending=$(echo "$smartdata" | grep "Current_Pending_Sector" | awk '{print $10}')
if [[ $pending -gt $PENDING_THRESHOLD ]]; then
send_alert "CRITICAL" "Drive $disk has $pending pending sectors"
fi
# Power on hours tracking
local power_hours=$(echo "$smartdata" | grep "Power_On_Hours" | awk '{print $10}')
log_message "$disk power-on hours: $power_hours"
return 0
}
Main monitoring loop
for disk in $(lsblk -d -n -o NAME | grep -E '^sd|^nvme' | sed 's/^/\/dev\//'); do
check_enterprise_disk $disk
done
```
Integration with Monitoring Systems
For integration with enterprise monitoring solutions:
```bash
Nagios/Icinga plugin format
#!/bin/bash
check_disk_health_nagios.sh
CRITICAL=0
WARNING=0
OK=0
MESSAGE=""
for disk in /dev/sd?; do
HEALTH=$(sudo smartctl -H $disk 2>/dev/null | grep "overall-health" | awk '{print $6}')
if [[ "$HEALTH" != "PASSED" ]]; then
CRITICAL=1
MESSAGE="$MESSAGE $disk:CRITICAL"
else
# Check for warnings
REALLOCATED=$(sudo smartctl -A $disk | grep "Reallocated_Sector_Ct" | awk '{print $10}')
if [[ $REALLOCATED -gt 0 ]]; then
WARNING=1
MESSAGE="$MESSAGE $disk:WARNING($REALLOCATED reallocated)"
else
OK=1
fi
fi
done
if [[ $CRITICAL -eq 1 ]]; then
echo "CRITICAL - Disk health issues:$MESSAGE"
exit 2
elif [[ $WARNING -eq 1 ]]; then
echo "WARNING - Disk health concerns:$MESSAGE"
exit 1
else
echo "OK - All disks healthy"
exit 0
fi
```
Conclusion
Effective disk health monitoring is a critical aspect of Linux system administration that requires a multi-layered approach. By combining SMART monitoring, file system checks, log analysis, and proactive scripting, you can significantly reduce the risk of unexpected data loss and system downtime.
Key Takeaways
1. Use SMART monitoring as your primary tool - It provides the most comprehensive health information
2. Implement automated monitoring - Regular scripted checks catch issues early
3. Don't ignore warnings - Even minor SMART attribute changes can indicate developing problems
4. Maintain regular backups - No monitoring solution replaces good backup practices
5. Document your findings - Keep records of drive health trends and maintenance activities
Recommended Implementation Strategy
1. Start with basic SMART monitoring using daily automated checks
2. Add file system verification for critical partitions
3. Implement log monitoring to catch real-time issues
4. Deploy comprehensive scripts tailored to your environment
5. Establish alert mechanisms for immediate notification of problems
Final Recommendations
- Test your monitoring setup regularly to ensure it's working correctly
- Keep monitoring tools updated to support newer drive technologies
- Train team members on interpreting health data and responding to alerts
- Review and adjust thresholds based on your specific hardware and usage patterns
Remember that disk health monitoring is an ongoing process, not a one-time setup. Regular review and refinement of your monitoring strategy will help ensure the long-term reliability of your Linux systems and protect your valuable data from unexpected hardware failures.
By following the comprehensive approaches outlined in this guide, you'll be well-equipped to maintain optimal disk health across your Linux infrastructure, from single desktop systems to complex enterprise server environments.