How to recover Linux from disaster - Adv. Disaster Recovery & Backup Guide

How to Recover Linux from Disaster Linux systems, while robust and reliable, are not immune to disasters. Whether facing hardware failures, corrupted filesystems, accidental deletions, or system crashes, knowing how to effectively recover your Linux environment is crucial for system administrators and users alike. This comprehensive guide will walk you through various disaster recovery scenarios, providing step-by-step instructions, practical examples, and professional insights to help you restore your Linux system to full functionality. Table of Contents 1. [Understanding Linux Disaster Scenarios](#understanding-linux-disaster-scenarios) 2. [Prerequisites and Preparation](#prerequisites-and-preparation) 3. [Boot Recovery Methods](#boot-recovery-methods) 4. [Filesystem Recovery Techniques](#filesystem-recovery-techniques) 5. [Data Recovery Strategies](#data-recovery-strategies) 6. [System Configuration Recovery](#system-configuration-recovery) 7. [Network and Service Recovery](#network-and-service-recovery) 8. [Advanced Recovery Techniques](#advanced-recovery-techniques) 9. [Troubleshooting Common Issues](#troubleshooting-common-issues) 10. [Best Practices and Prevention](#best-practices-and-prevention) Understanding Linux Disaster Scenarios Before diving into recovery procedures, it's essential to understand the types of disasters that can affect Linux systems: Hardware-Related Disasters - Hard drive failures: Complete disk failure or bad sectors - Memory corruption: RAM issues causing system instability - Power supply problems: Sudden shutdowns leading to filesystem corruption - Motherboard failures: Complete system hardware breakdown Software-Related Disasters - Bootloader corruption: GRUB or other bootloader issues preventing system startup - Kernel panics: System crashes due to kernel-level problems - Filesystem corruption: Damaged file structures preventing normal operation - Configuration file corruption: Critical system files becoming unreadable Human Error Disasters - Accidental deletions: Removing critical system files or user data - Incorrect permissions: Changing file permissions that break system functionality - Misconfigured services: Breaking essential system services through poor configuration Security-Related Disasters - Malware infections: Viruses or rootkits compromising system integrity - Unauthorized access: System compromise through security breaches - Data breaches: Sensitive information exposure requiring system recovery Prerequisites and Preparation Essential Tools and Resources Before attempting any recovery procedure, ensure you have access to: Recovery Media ```bash Create a bootable USB drive with a Linux distribution sudo dd if=ubuntu-20.04.3-desktop-amd64.iso of=/dev/sdX bs=4M status=progress sync ``` Backup Verification ```bash Verify backup integrity before disaster strikes tar -tzf backup.tar.gz > /dev/null echo $? # Should return 0 for successful verification ``` Network Access Ensure you have alternative internet connectivity for downloading tools and accessing documentation during recovery. Documentation Maintain offline copies of: - System configuration details - Network settings - User account information - Installed package lists - Custom application configurations Pre-Disaster Preparation System Information Gathering ```bash Document system information uname -a > system_info.txt lscpu >> system_info.txt lsblk >> system_info.txt df -h >> system_info.txt mount >> system_info.txt ``` Package List Backup ```bash Debian/Ubuntu systems dpkg --get-selections > package_list.txt Red Hat/CentOS/Fedora systems rpm -qa > package_list.txt Arch Linux systems pacman -Qqe > package_list.txt ``` Boot Recovery Methods GRUB Recovery When your system won't boot due to GRUB issues, follow these steps: Method 1: GRUB Rescue Mode ```bash If you see grub rescue> prompt grub rescue> ls grub rescue> set root=(hd0,1) grub rescue> set prefix=(hd0,1)/boot/grub grub rescue> insmod normal grub rescue> normal ``` Method 2: Live CD GRUB Reinstallation ```bash Boot from live CD/USB sudo mount /dev/sda1 /mnt sudo mount --bind /dev /mnt/dev sudo mount --bind /proc /mnt/proc sudo mount --bind /sys /mnt/sys sudo chroot /mnt Reinstall GRUB grub-install /dev/sda update-grub exit Unmount and reboot sudo umount /mnt/sys sudo umount /mnt/proc sudo umount /mnt/dev sudo umount /mnt ``` Single User Mode Recovery Access single-user mode for system repairs: ```bash At GRUB menu, edit kernel line and add: single or init=/bin/bash Once in single-user mode: mount -o remount,rw / Perform necessary repairs Reboot when complete ``` SystemD Emergency Mode For systemd-based systems: ```bash Add to kernel parameters: systemd.unit=emergency.target Or use rescue mode: systemd.unit=rescue.target ``` Filesystem Recovery Techniques Filesystem Check and Repair ext2/ext3/ext4 Filesystems ```bash Unmount the filesystem first sudo umount /dev/sda1 Check and repair automatically sudo fsck -y /dev/sda1 For more serious corruption: sudo e2fsck -f -y /dev/sda1 Force check even if filesystem appears clean: sudo e2fsck -f /dev/sda1 ``` XFS Filesystem Recovery ```bash XFS repair (filesystem must be unmounted) sudo umount /dev/sda1 sudo xfs_repair /dev/sda1 For more aggressive repair: sudo xfs_repair -L /dev/sda1 # Zeroes log, use as last resort ``` Btrfs Filesystem Recovery ```bash Check Btrfs filesystem sudo btrfs check /dev/sda1 Repair Btrfs (dangerous, backup first) sudo btrfs check --repair /dev/sda1 Scrub for error detection and repair sudo btrfs scrub start /mount/point sudo btrfs scrub status /mount/point ``` Advanced Filesystem Recovery Using TestDisk for Partition Recovery ```bash Install TestDisk sudo apt-get install testdisk Run TestDisk sudo testdisk Follow interactive prompts to: 1. Select disk 2. Choose partition table type 3. Analyze partition structure 4. Search for lost partitions 5. Write partition table ``` Recovering Deleted Files with PhotoRec ```bash PhotoRec comes with TestDisk sudo photorec Select: 1. Physical disk 2. Partition type 3. File types to recover 4. Destination directory 5. Start recovery process ``` Data Recovery Strategies Using ddrescue for Disk Imaging Create a bit-by-bit copy of a failing drive: ```bash Install ddrescue sudo apt-get install gddrescue Create disk image with error handling sudo ddrescue -d -r3 /dev/sda /path/to/recovery/disk_image.img /path/to/recovery/logfile.log Mount the recovered image sudo losetup /dev/loop0 /path/to/recovery/disk_image.img sudo mount /dev/loop0 /mnt/recovered ``` File Recovery with Extundelete Recover deleted files from ext3/ext4 filesystems: ```bash Install extundelete sudo apt-get install extundelete Unmount the filesystem sudo umount /dev/sda1 Recover all deleted files sudo extundelete /dev/sda1 --restore-all Recover specific file sudo extundelete /dev/sda1 --restore-file /path/to/deleted/file.txt Recover files deleted after specific date sudo extundelete /dev/sda1 --restore-files --after $(date -d "2023-01-01" +%s) ``` Database Recovery MySQL/MariaDB Recovery ```bash Start MySQL in recovery mode sudo mysqld_safe --skip-grant-tables --skip-networking & Connect and reset if needed mysql -u root USE mysql; UPDATE user SET authentication_string = PASSWORD('newpassword') WHERE User = 'root'; FLUSH PRIVILEGES; Repair tables mysqlcheck --repair --all-databases InnoDB recovery sudo systemctl stop mysql Edit /etc/mysql/my.cnf and add: innodb_force_recovery = 1 sudo systemctl start mysql Export data and recreate database ``` PostgreSQL Recovery ```bash Single-user mode recovery sudo -u postgres postgres --single -D /var/lib/postgresql/data VACUUM and REINDEX in single-user mode VACUUM FULL; REINDEX DATABASE your_database; WAL recovery sudo -u postgres pg_resetwal /var/lib/postgresql/data ``` System Configuration Recovery Restoring from Configuration Backups Using etckeeper for /etc Recovery ```bash If etckeeper was previously set up cd /etc sudo git log --oneline # View configuration history sudo git checkout HEAD~5 # Restore to 5 commits ago sudo git checkout master # Return to current state ``` Manual Configuration Restoration ```bash Restore network configuration sudo cp /backup/etc/network/interfaces /etc/network/interfaces sudo systemctl restart networking Restore user accounts sudo cp /backup/etc/passwd /etc/passwd sudo cp /backup/etc/shadow /etc/shadow sudo cp /backup/etc/group /etc/group Restore SSH configuration sudo cp /backup/etc/ssh/sshd_config /etc/ssh/sshd_config sudo systemctl restart sshd ``` Service Configuration Recovery SystemD Service Recovery ```bash Restore service files sudo cp /backup/etc/systemd/system/* /etc/systemd/system/ sudo systemctl daemon-reload Re-enable services sudo systemctl enable apache2 sudo systemctl enable mysql sudo systemctl enable ssh Check service status sudo systemctl status --all ``` Crontab Recovery ```bash Restore system crontab sudo cp /backup/etc/crontab /etc/crontab Restore user crontabs sudo cp /backup/var/spool/cron/crontabs/* /var/spool/cron/crontabs/ sudo systemctl restart cron ``` Network and Service Recovery Network Configuration Recovery Static IP Configuration ```bash Ubuntu/Debian - /etc/network/interfaces auto eth0 iface eth0 inet static address 192.168.1.100 netmask 255.255.255.0 gateway 192.168.1.1 dns-nameservers 8.8.8.8 8.8.4.4 Restart networking sudo systemctl restart networking ``` NetworkManager Recovery ```bash Restart NetworkManager sudo systemctl restart NetworkManager Reset network configuration sudo rm /etc/NetworkManager/system-connections/* sudo systemctl restart NetworkManager Reconfigure network through nmtui sudo nmtui ``` DNS Resolution Recovery ```bash Restore /etc/resolv.conf echo "nameserver 8.8.8.8" | sudo tee /etc/resolv.conf echo "nameserver 8.8.4.4" | sudo tee -a /etc/resolv.conf Test DNS resolution nslookup google.com dig google.com ``` Advanced Recovery Techniques LVM Recovery Recovering LVM Metadata ```bash Scan for LVM volumes sudo pvscan sudo vgscan sudo lvscan Activate volume groups sudo vgchange -ay Restore LVM metadata from backup sudo vgcfgrestore volume_group_name Mount recovered logical volumes sudo mount /dev/volume_group/logical_volume /mnt/recovery ``` RAID Recovery Software RAID Recovery ```bash Check RAID status cat /proc/mdstat Stop failed RAID sudo mdadm --stop /dev/md0 Reassemble RAID sudo mdadm --assemble --scan Force assembly with missing drives sudo mdadm --assemble --force /dev/md0 /dev/sda1 /dev/sdb1 Add replacement drive sudo mdadm --add /dev/md0 /dev/sdc1 ``` Encrypted Filesystem Recovery LUKS Recovery ```bash Check LUKS header sudo cryptsetup luksDump /dev/sda1 Repair LUKS header sudo cryptsetup repair /dev/sda1 Open encrypted volume sudo cryptsetup luksOpen /dev/sda1 encrypted_volume Mount decrypted filesystem sudo mount /dev/mapper/encrypted_volume /mnt/encrypted ``` Troubleshooting Common Issues Boot Issues Kernel Panic Resolution ```bash Boot with previous kernel version from GRUB menu Or add kernel parameters: nomodeset acpi=off Check system logs after boot journalctl -b -1 # Previous boot dmesg | less # Current boot messages ``` Initramfs Issues ```bash Regenerate initramfs sudo update-initramfs -u -k all For specific kernel version sudo update-initramfs -u -k 5.4.0-74-generic ``` Filesystem Issues Read-Only Filesystem ```bash Remount as read-write sudo mount -o remount,rw / Check for filesystem errors sudo fsck -f /dev/sda1 ``` Inode Exhaustion ```bash Check inode usage df -i Find directories with many files find /path -type d -exec bash -c 'echo -n "{}: "; ls -1 "{}" | wc -l' \; Clean up unnecessary files sudo find /tmp -type f -atime +7 -delete sudo find /var/log -name "*.log" -type f -size +100M -delete ``` Permission Issues Fixing Broken Permissions ```bash Reset /etc permissions sudo chmod 755 /etc sudo chmod 644 /etc/passwd sudo chmod 600 /etc/shadow Reset home directory permissions sudo chmod 755 /home/username sudo chmod 700 /home/username/.ssh sudo chmod 600 /home/username/.ssh/authorized_keys ``` Best Practices and Prevention Backup Strategies Automated System Backups ```bash #!/bin/bash Comprehensive backup script BACKUP_DIR="/backup/$(date +%Y%m%d)" mkdir -p "$BACKUP_DIR" System configuration tar -czf "$BACKUP_DIR/etc_backup.tar.gz" /etc User data tar -czf "$BACKUP_DIR/home_backup.tar.gz" /home Database backup mysqldump --all-databases > "$BACKUP_DIR/mysql_backup.sql" Package list dpkg --get-selections > "$BACKUP_DIR/package_list.txt" System information uname -a > "$BACKUP_DIR/system_info.txt" lsblk >> "$BACKUP_DIR/system_info.txt" ``` Using rsync for Incremental Backups ```bash Daily incremental backup rsync -avz --delete /home/ backup_server:/backups/home/ rsync -avz --delete /etc/ backup_server:/backups/etc/ rsync -avz --delete /var/www/ backup_server:/backups/www/ ``` Monitoring and Early Detection System Health Monitoring ```bash Install monitoring tools sudo apt-get install smartmontools lm-sensors Check disk health sudo smartctl -a /dev/sda Monitor system temperatures sensors Check system logs regularly journalctl -p err -b ``` Automated Health Checks ```bash #!/bin/bash Daily health check script LOG_FILE="/var/log/health_check.log" echo "$(date): Starting health check" >> "$LOG_FILE" Check disk space df -h | awk '$5 > 90 {print "WARNING: " $0}' >> "$LOG_FILE" Check memory usage free -m | awk 'NR==2{printf "Memory Usage: %s/%sMB (%.2f%%)\n", $3,$2,$3*100/$2 }' >> "$LOG_FILE" Check load average uptime >> "$LOG_FILE" Check for failed services systemctl --failed >> "$LOG_FILE" echo "$(date): Health check completed" >> "$LOG_FILE" ``` Documentation and Change Management Maintain Recovery Documentation - Keep updated network diagrams - Document all system changes - Maintain contact information for vendors - Create step-by-step recovery procedures - Test recovery procedures regularly Version Control for Configurations ```bash Initialize git repository for /etc cd /etc sudo git init sudo git add . sudo git commit -m "Initial configuration snapshot" Create hooks for automatic commits echo '#!/bin/bash cd /etc && git add -A && git commit -m "Auto-commit $(date)"' | sudo tee /etc/cron.daily/etc-backup sudo chmod +x /etc/cron.daily/etc-backup ``` Testing Recovery Procedures Regular Recovery Drills - Schedule monthly recovery tests - Document test results and improvements - Update procedures based on lessons learned - Train team members on recovery procedures Virtualized Testing Environment ```bash Create VM snapshots before major changes virsh snapshot-create-as domain_name snapshot_name "Pre-update snapshot" Test recovery procedures in isolated environment virsh restore domain_name snapshot_name ``` Conclusion Linux disaster recovery requires preparation, knowledge, and the right tools. By understanding common disaster scenarios, maintaining proper backups, and following systematic recovery procedures, you can minimize downtime and data loss when disasters strike. Key takeaways for effective Linux disaster recovery: 1. Prevention is Better Than Cure: Implement robust backup strategies and monitoring systems 2. Document Everything: Maintain detailed system documentation and recovery procedures 3. Test Regularly: Regularly test backup integrity and recovery procedures 4. Stay Calm and Systematic: Follow established procedures methodically during actual disasters 5. Learn and Improve: Document lessons learned from each incident to improve future recovery efforts Remember that disaster recovery is an ongoing process, not a one-time setup. Regular maintenance of backup systems, testing of recovery procedures, and staying updated with the latest recovery tools and techniques are essential for maintaining a robust disaster recovery capability. By implementing the strategies and techniques outlined in this guide, you'll be well-prepared to handle various Linux disaster scenarios and restore your systems to full functionality with minimal disruption to your operations.