How to create system health check scripts

How to Create System Health Check Scripts System health check scripts are essential tools for maintaining optimal server performance and preventing critical failures before they impact your infrastructure. These automated scripts monitor key system metrics, identify potential issues, and provide administrators with timely alerts about system status. This comprehensive guide will walk you through creating robust health check scripts for both Linux and Windows environments, covering everything from basic monitoring to advanced alerting mechanisms. Table of Contents 1. [Introduction](#introduction) 2. [Prerequisites](#prerequisites) 3. [Understanding System Health Metrics](#understanding-system-health-metrics) 4. [Creating Basic Health Check Scripts](#creating-basic-health-check-scripts) 5. [Advanced Monitoring Features](#advanced-monitoring-features) 6. [Implementing Alerting Systems](#implementing-alerting-systems) 7. [Cross-Platform Considerations](#cross-platform-considerations) 8. [Automation and Scheduling](#automation-and-scheduling) 9. [Troubleshooting Common Issues](#troubleshooting-common-issues) 10. [Best Practices](#best-practices) 11. [Conclusion](#conclusion) Introduction System health check scripts serve as the first line of defense against system failures by continuously monitoring critical system components. These scripts can detect issues such as high CPU usage, memory leaks, disk space shortages, network connectivity problems, and service failures. By implementing comprehensive health checks, system administrators can proactively address problems before they escalate into critical outages. The benefits of implementing system health check scripts include: - Proactive monitoring: Identify issues before they become critical - Automated alerting: Receive notifications when thresholds are exceeded - Historical tracking: Maintain logs of system performance over time - Cost reduction: Minimize downtime and associated business costs - Improved reliability: Ensure consistent system performance Prerequisites Before creating system health check scripts, ensure you have: Technical Requirements - Basic understanding of shell scripting (Bash for Linux, PowerShell for Windows) - Administrative access to target systems - Text editor or integrated development environment (IDE) - Understanding of system administration concepts - Familiarity with system monitoring tools and commands System Requirements - Linux Systems: Bash shell, standard Unix utilities (ps, df, free, netstat) - Windows Systems: PowerShell 3.0 or later, WMI access - Network Access: For remote monitoring and alerting - Storage Space: Adequate space for log files and historical data Knowledge Prerequisites - Understanding of system performance metrics - Basic networking concepts - Log file management - Cron jobs (Linux) or Task Scheduler (Windows) - Email configuration for alerting Understanding System Health Metrics Core System Metrics Effective health check scripts monitor several key system metrics: CPU Utilization CPU usage indicates system processing load and helps identify performance bottlenecks. Monitor both current usage and load averages over time. Memory Usage Track RAM utilization, including available memory, used memory, and swap usage. Memory leaks and insufficient RAM can severely impact system performance. Disk Space and I/O Monitor disk space usage across all mounted filesystems and track disk I/O performance to prevent storage-related failures. Network Connectivity Verify network interfaces are operational and test connectivity to critical services and external resources. System Services Ensure critical services and processes are running and responding appropriately. System Load Monitor system load averages to understand overall system stress and performance trends. Threshold Definition Establishing appropriate thresholds is crucial for effective monitoring: - Warning Thresholds: 70-80% utilization for most resources - Critical Thresholds: 90-95% utilization requiring immediate attention - Custom Thresholds: Adjust based on specific system requirements and historical performance data Creating Basic Health Check Scripts Linux Health Check Script Here's a comprehensive Linux health check script that monitors essential system metrics: ```bash #!/bin/bash System Health Check Script for Linux Author: System Administrator Version: 1.0 Configuration LOG_FILE="/var/log/system_health.log" EMAIL_ALERT="admin@company.com" CPU_THRESHOLD=80 MEMORY_THRESHOLD=85 DISK_THRESHOLD=90 LOAD_THRESHOLD=2.0 Color codes for output RED='\033[0;31m' YELLOW='\033[1;33m' GREEN='\033[0;32m' NC='\033[0m' # No Color Function to log messages log_message() { echo "$(date '+%Y-%m-%d %H:%M:%S') - $1" >> "$LOG_FILE" } Function to send email alerts send_alert() { local subject="$1" local message="$2" echo "$message" | mail -s "$subject" "$EMAIL_ALERT" log_message "ALERT: $subject - $message" } Check CPU usage check_cpu() { echo "=== CPU Usage Check ===" cpu_usage=$(top -bn1 | grep "Cpu(s)" | awk '{print $2}' | sed 's/%us,//') cpu_idle=$(top -bn1 | grep "Cpu(s)" | awk '{print $8}' | sed 's/%id,//') cpu_used=$(echo "100 - $cpu_idle" | bc) echo "CPU Usage: ${cpu_used}%" if (( $(echo "$cpu_used > $CPU_THRESHOLD" | bc -l) )); then echo -e "${RED}WARNING: CPU usage is high (${cpu_used}%)${NC}" send_alert "High CPU Usage Alert" "CPU usage is ${cpu_used}%, exceeding threshold of ${CPU_THRESHOLD}%" return 1 else echo -e "${GREEN}CPU usage is normal (${cpu_used}%)${NC}" return 0 fi } Check memory usage check_memory() { echo "=== Memory Usage Check ===" memory_info=$(free | grep Mem) total_mem=$(echo $memory_info | awk '{print $2}') used_mem=$(echo $memory_info | awk '{print $3}') memory_percent=$(echo "scale=2; $used_mem * 100 / $total_mem" | bc) echo "Memory Usage: ${memory_percent}%" echo "Total Memory: $(echo "scale=2; $total_mem / 1024 / 1024" | bc) GB" echo "Used Memory: $(echo "scale=2; $used_mem / 1024 / 1024" | bc) GB" if (( $(echo "$memory_percent > $MEMORY_THRESHOLD" | bc -l) )); then echo -e "${RED}WARNING: Memory usage is high (${memory_percent}%)${NC}" send_alert "High Memory Usage Alert" "Memory usage is ${memory_percent}%, exceeding threshold of ${MEMORY_THRESHOLD}%" return 1 else echo -e "${GREEN}Memory usage is normal (${memory_percent}%)${NC}" return 0 fi } Check disk usage check_disk() { echo "=== Disk Usage Check ===" disk_alert=0 while read filesystem size used available percent mountpoint; do if [[ $filesystem != "Filesystem" ]] && [[ $percent != "Use%" ]]; then usage_num=$(echo $percent | sed 's/%//') echo "Disk: $mountpoint - Usage: $percent" if [ "$usage_num" -gt "$DISK_THRESHOLD" ]; then echo -e "${RED}WARNING: Disk usage high on $mountpoint ($percent)${NC}" send_alert "High Disk Usage Alert" "Disk usage on $mountpoint is $percent, exceeding threshold of ${DISK_THRESHOLD}%" disk_alert=1 else echo -e "${GREEN}Disk usage normal on $mountpoint ($percent)${NC}" fi fi done < <(df -h) return $disk_alert } Check system load check_load() { echo "=== System Load Check ===" load_1min=$(uptime | awk -F'load average:' '{print $2}' | awk -F',' '{print $1}' | sed 's/ //g') load_5min=$(uptime | awk -F'load average:' '{print $2}' | awk -F',' '{print $2}' | sed 's/ //g') load_15min=$(uptime | awk -F'load average:' '{print $2}' | awk -F',' '{print $3}' | sed 's/ //g') echo "Load Average: 1min=$load_1min, 5min=$load_5min, 15min=$load_15min" if (( $(echo "$load_1min > $LOAD_THRESHOLD" | bc -l) )); then echo -e "${RED}WARNING: System load is high (${load_1min})${NC}" send_alert "High System Load Alert" "System load is ${load_1min}, exceeding threshold of ${LOAD_THRESHOLD}" return 1 else echo -e "${GREEN}System load is normal (${load_1min})${NC}" return 0 fi } Check critical services check_services() { echo "=== Critical Services Check ===" services=("ssh" "nginx" "mysql" "apache2") service_alert=0 for service in "${services[@]}"; do if systemctl is-active --quiet "$service"; then echo -e "${GREEN}Service $service is running${NC}" else echo -e "${RED}WARNING: Service $service is not running${NC}" send_alert "Service Down Alert" "Critical service $service is not running" service_alert=1 fi done return $service_alert } Check network connectivity check_network() { echo "=== Network Connectivity Check ===" hosts=("8.8.8.8" "google.com") network_alert=0 for host in "${hosts[@]}"; do if ping -c 1 "$host" > /dev/null 2>&1; then echo -e "${GREEN}Network connectivity to $host: OK${NC}" else echo -e "${RED}WARNING: Cannot reach $host${NC}" send_alert "Network Connectivity Alert" "Cannot reach $host - network connectivity issue" network_alert=1 fi done return $network_alert } Main execution main() { echo "==========================================" echo "System Health Check - $(date)" echo "==========================================" log_message "Starting system health check" # Initialize alert counter total_alerts=0 # Run all checks check_cpu || ((total_alerts++)) echo check_memory || ((total_alerts++)) echo check_disk || ((total_alerts++)) echo check_load || ((total_alerts++)) echo check_services || ((total_alerts++)) echo check_network || ((total_alerts++)) echo "==========================================" echo "Health Check Summary" echo "==========================================" if [ $total_alerts -eq 0 ]; then echo -e "${GREEN}All systems are healthy!${NC}" log_message "Health check completed - All systems healthy" else echo -e "${YELLOW}Total alerts: $total_alerts${NC}" log_message "Health check completed - $total_alerts alerts generated" fi echo "Detailed logs available at: $LOG_FILE" } Execute main function main ``` Windows Health Check Script Here's a comprehensive PowerShell script for Windows system health monitoring: ```powershell System Health Check Script for Windows Author: System Administrator Version: 1.0 param( [string]$LogPath = "C:\Logs\SystemHealth.log", [string]$EmailTo = "admin@company.com", [int]$CPUThreshold = 80, [int]$MemoryThreshold = 85, [int]$DiskThreshold = 90 ) Function to write log entries function Write-Log { param([string]$Message, [string]$Level = "INFO") $timestamp = Get-Date -Format "yyyy-MM-dd HH:mm:ss" $logEntry = "$timestamp - [$Level] $Message" Write-Host $logEntry Add-Content -Path $LogPath -Value $logEntry } Function to send email alerts function Send-Alert { param([string]$Subject, [string]$Body) try { $smtpServer = "smtp.company.com" $smtpFrom = "healthcheck@company.com" $mailMessage = New-Object System.Net.Mail.MailMessage $mailMessage.From = $smtpFrom $mailMessage.To.Add($EmailTo) $mailMessage.Subject = $Subject $mailMessage.Body = $Body $smtpClient = New-Object System.Net.Mail.SmtpClient($smtpServer) $smtpClient.Send($mailMessage) Write-Log "Alert sent: $Subject" "ALERT" } catch { Write-Log "Failed to send alert: $($_.Exception.Message)" "ERROR" } } Check CPU usage function Test-CPUUsage { Write-Host "=== CPU Usage Check ===" -ForegroundColor Cyan try { $cpuUsage = Get-WmiObject -Class Win32_Processor | Measure-Object -Property LoadPercentage -Average | Select-Object -ExpandProperty Average Write-Host "CPU Usage: $cpuUsage%" if ($cpuUsage -gt $CPUThreshold) { Write-Host "WARNING: CPU usage is high ($cpuUsage%)" -ForegroundColor Red Send-Alert "High CPU Usage Alert" "CPU usage is $cpuUsage%, exceeding threshold of $CPUThreshold%" Write-Log "High CPU usage detected: $cpuUsage%" "WARNING" return $false } else { Write-Host "CPU usage is normal ($cpuUsage%)" -ForegroundColor Green return $true } } catch { Write-Log "Error checking CPU usage: $($_.Exception.Message)" "ERROR" return $false } } Main execution function function Start-HealthCheck { Write-Host "==========================================" -ForegroundColor Yellow Write-Host "System Health Check - $(Get-Date)" -ForegroundColor Yellow Write-Host "==========================================" -ForegroundColor Yellow Write-Log "Starting system health check" # Execute all checks and generate summary # Implementation details continue... } Execute the health check Start-HealthCheck ``` Advanced Monitoring Features Database Health Checks For systems running databases, include specific database health monitoring: ```bash Database health check function check_database() { echo "=== Database Health Check ===" # MySQL health check if command -v mysql &> /dev/null; then mysql_status=$(mysqladmin ping 2>/dev/null) if [[ $mysql_status == "alive" ]]; then echo -e "${GREEN}MySQL is running${NC}" # Check for slow queries slow_queries=$(mysql -e "SHOW GLOBAL STATUS LIKE 'Slow_queries';" | awk 'NR==2{print $2}') echo "Slow queries: $slow_queries" # Check connections connections=$(mysql -e "SHOW GLOBAL STATUS LIKE 'Threads_connected';" | awk 'NR==2{print $2}') max_connections=$(mysql -e "SHOW VARIABLES LIKE 'max_connections';" | awk 'NR==2{print $2}') connection_percent=$(echo "scale=2; $connections * 100 / $max_connections" | bc) echo "Database connections: $connections/$max_connections (${connection_percent}%)" else echo -e "${RED}WARNING: MySQL is not responding${NC}" send_alert "Database Alert" "MySQL database is not responding" fi fi } ``` Performance Metrics Collection Collect detailed performance metrics for trending analysis: ```bash Performance metrics collection collect_performance_metrics() { metrics_file="/var/log/performance_metrics.csv" timestamp=$(date '+%Y-%m-%d %H:%M:%S') # CPU metrics cpu_usage=$(top -bn1 | grep "Cpu(s)" | awk '{print $2}' | sed 's/%us,//') load_avg=$(uptime | awk -F'load average:' '{print $2}' | awk -F',' '{print $1}' | sed 's/ //g') # Memory metrics memory_info=$(free | grep Mem) total_mem=$(echo $memory_info | awk '{print $2}') used_mem=$(echo $memory_info | awk '{print $3}') memory_percent=$(echo "scale=2; $used_mem * 100 / $total_mem" | bc) # Write metrics to CSV if [[ ! -f "$metrics_file" ]]; then echo "timestamp,cpu_usage,load_avg,memory_percent" > "$metrics_file" fi echo "$timestamp,$cpu_usage,$load_avg,$memory_percent" >> "$metrics_file" } ``` Implementing Alerting Systems Email Alerting Configure robust email alerting with HTML formatting: ```bash Enhanced email alerting function send_enhanced_alert() { local priority="$1" local subject="$2" local message="$3" local hostname=$(hostname) local timestamp=$(date '+%Y-%m-%d %H:%M:%S') # Set priority-based styling case $priority in "CRITICAL") color="#FF0000" urgency="High" ;; "WARNING") color="#FFA500" urgency="Medium" ;; "INFO") color="#0000FF" urgency="Low" ;; esac # Create HTML email content html_content="

[$priority] $subject

Server: $hostname

Time: $timestamp

Priority: $urgency


Details:

$message

This is an automated alert from the system health monitoring script.

" # Send email with HTML content echo "$html_content" | mail -s "[$priority] $subject - $hostname" \ -a "Content-Type: text/html" \ -a "X-Priority: 1" \ "$EMAIL_ALERT" log_message "ALERT SENT: [$priority] $subject" } ``` SMS Alerting For critical alerts, implement SMS notifications using Twilio API: ```bash SMS alert function using Twilio API send_sms_alert() { local message="$1" local phone_number="YOUR_PHONE_NUMBER" local account_sid="YOUR_TWILIO_ACCOUNT_SID" local auth_token="YOUR_TWILIO_AUTH_TOKEN" local from_number="YOUR_TWILIO_PHONE_NUMBER" curl -X POST "https://api.twilio.com/2010-04-01/Accounts/$account_sid/Messages.json" \ --data-urlencode "To=$phone_number" \ --data-urlencode "From=$from_number" \ --data-urlencode "Body=$message" \ -u "$account_sid:$auth_token" } ``` Slack Integration Integrate with Slack for real-time team notifications: ```bash Slack notification function send_slack_alert() { local priority="$1" local subject="$2" local message="$3" local webhook_url="YOUR_SLACK_WEBHOOK_URL" local hostname=$(hostname) # Set emoji based on priority case $priority in "CRITICAL") emoji=":red_circle:" ;; "WARNING") emoji=":warning:" ;; "INFO") emoji=":information_source:" ;; esac # Create JSON payload payload=$(cat </5 * /opt/scripts/health_check.sh Every hour - comprehensive check with metrics collection 0 /opt/scripts/health_check.sh --comprehensive Daily at 2 AM - full system report 0 2 * /opt/scripts/health_check.sh --full-report Weekly on Sunday at 3 AM - maintenance check 0 3 0 /opt/scripts/health_check.sh --maintenance ``` Windows Task Scheduler Create scheduled tasks for Windows systems: ```powershell PowerShell script to create scheduled task $TaskName = "SystemHealthCheck" $ScriptPath = "C:\Scripts\health_check.ps1" Create basic daily task $Action = New-ScheduledTaskAction -Execute "PowerShell.exe" -Argument "-ExecutionPolicy Bypass -File $ScriptPath" $Trigger = New-ScheduledTaskTrigger -Daily -At "02:00" $Settings = New-ScheduledTaskSettingsSet -AllowStartIfOnBatteries -DontStopIfGoingOnBatteries Register-ScheduledTask -TaskName $TaskName -Action $Action -Trigger $Trigger -Settings $Settings Create frequent monitoring task (every 15 minutes) $FrequentTaskName = "SystemHealthCheckFrequent" $FrequentTrigger = New-ScheduledTaskTrigger -Once -At (Get-Date) -RepetitionInterval (New-TimeSpan -Minutes 15) -RepetitionDuration (New-TimeSpan -Hours 23 -Minutes 59) Register-ScheduledTask -TaskName $FrequentTaskName -Action $Action -Trigger $FrequentTrigger -Settings $Settings ``` Systemd Service (Linux) Create a systemd service for continuous monitoring: ```ini /etc/systemd/system/health-monitor.service [Unit] Description=System Health Monitor After=network.target [Service] Type=simple ExecStart=/opt/scripts/health_monitor_daemon.sh Restart=always RestartSec=10 User=monitor Group=monitor StandardOutput=journal StandardError=journal [Install] WantedBy=multi-user.target ``` ```bash Enable and start the service sudo systemctl enable health-monitor.service sudo systemctl start health-monitor.service ``` Troubleshooting Common Issues Permission Problems Common permission issues and solutions: ```bash Fix log file permissions sudo chown monitor:monitor /var/log/system_health.log sudo chmod 644 /var/log/system_health.log Fix script execution permissions chmod +x /opt/scripts/health_check.sh Add user to required groups sudo usermod -a -G adm,systemd-journal monitor ``` Mail Configuration Issues Troubleshoot email delivery problems: ```bash Test mail configuration echo "Test message" | mail -s "Test Subject" admin@company.com Check mail logs tail -f /var/log/mail.log Configure postfix for relay sudo dpkg-reconfigure postfix ``` Network Connectivity Issues Debug network-related monitoring problems: ```bash Test connectivity with detailed output ping -c 4 -v 8.8.8.8 Check DNS resolution nslookup google.com Test specific ports telnet smtp.company.com 25 ``` Resource Monitoring Accuracy Ensure accurate resource monitoring: ```bash Calibrate CPU monitoring Use multiple samples for accuracy cpu_samples=() for i in {1..5}; do cpu_sample=$(top -bn1 | grep "Cpu(s)" | awk '{print $2}' | sed 's/%us,//') cpu_samples+=($cpu_sample) sleep 1 done Calculate average cpu_avg=$(echo "${cpu_samples[@]}" | awk '{sum=0; for(i=1; i<=NF; i++) sum+=$i; print sum/NF}') ``` Script Debugging Add comprehensive debugging capabilities: ```bash Debug mode function debug_mode() { if [[ "$DEBUG" == "true" ]]; then echo "DEBUG: $1" >&2 log_message "DEBUG: $1" fi } Verbose logging set -x # Enable command tracing trap 'echo "Error on line $LINENO"' ERR # Error trapping ``` Best Practices Security Considerations Implement security best practices: ```bash Secure credential storage Use environment variables or secure vaults export SMTP_PASSWORD=$(cat /etc/secrets/smtp_password) Restrict file permissions chmod 600 /etc/secrets/smtp_password chown root:root /etc/secrets/smtp_password Use dedicated monitoring user useradd -r -s /bin/false -d /var/lib/monitor monitor ``` Performance Optimization Optimize script performance: ```bash Parallel execution for independent checks check_cpu & cpu_pid=$! check_memory & memory_pid=$! check_disk & disk_pid=$! Wait for all checks to complete wait $cpu_pid $memory_pid $disk_pid ``` Error Handling Implement robust error handling: ```bash Error handling function handle_error() { local line_number=$1 local error_code=$2 local command="$3" log_message "ERROR: Command '$command' failed with exit code $error_code at line $line_number" send_alert "Script Error" "Health check script encountered an error at line $line_number: $command" } Set error trap trap 'handle_error ${LINENO} $? "$BASH_COMMAND"' ERR ``` Log Management Implement proper log rotation: ```bash Logrotate configuration for health check logs /etc/logrotate.d/health-check /var/log/system_health.log { daily rotate 30 compress delaycompress missingok notifempty create 644 monitor monitor } ``` Configuration Management Use configuration files for flexibility: ```bash Load configuration from file load_config() { local config_file="/etc/health-check/config.conf" if [[ -f "$config_file" ]]; then source "$config_file" else log_message "WARNING: Config file not found, using defaults" fi } Example config file /etc/health-check/config.conf CPU_THRESHOLD=85 MEMORY_THRESHOLD=90 DISK_THRESHOLD=95 EMAIL_ALERT="ops-team@company.com" ENABLE_SMS_ALERTS=true DEBUG_MODE=false ``` Monitoring Script Health Monitor the health check script itself: ```bash Script health check check_script_health() { local script_pid_file="/var/run/health-check.pid" local max_runtime=300 # 5 minutes if [[ -f "$script_pid_file" ]]; then local script_pid=$(cat "$script_pid_file") local script_start_time=$(stat -c %Y "$script_pid_file") local current_time=$(date +%s) local runtime=$((current_time - script_start_time)) if [[ $runtime -gt $max_runtime ]]; then log_message "WARNING: Health check script running too long ($runtime seconds)" kill -9 "$script_pid" 2>/dev/null || true rm -f "$script_pid_file" fi fi echo $$ > "$script_pid_file" } ``` Documentation and Maintenance Maintain comprehensive documentation: ```bash Script header documentation #!/bin/bash System Health Check Script Purpose: Monitor system resources and alert on threshold violations Author: System Administrator Team Version: 2.1 Last Updated: $(date) Dependencies: - bc (for floating point calculations) - mail (for email notifications) - curl (for webhook notifications) Configuration Files: - /etc/health-check/config.conf (main configuration) - /etc/health-check/thresholds.conf (monitoring thresholds) Log Files: - /var/log/system_health.log (main log) - /var/log/performance_metrics.csv (metrics data) Exit Codes: 0 - Success, all checks passed 1 - Warning conditions detected 2 - Critical conditions detected 3 - Script configuration error 4 - System error (permissions, dependencies, etc.) ``` Conclusion System health check scripts are indispensable tools for maintaining robust, reliable server infrastructure. By implementing comprehensive monitoring solutions that cover CPU usage, memory utilization, disk space, network connectivity, and service status, administrators can proactively identify and address potential issues before they impact system availability. The scripts and techniques presented in this guide provide a solid foundation for building effective monitoring solutions. Key takeaways include: Essential Components: - Comprehensive resource monitoring across all critical system metrics - Configurable thresholds with warning and critical levels - Multi-channel alerting systems including email, SMS, and collaboration tools - Detailed logging and historical data collection - Cross-platform compatibility considerations Implementation Best Practices: - Use proper error handling and debugging mechanisms - Implement security measures for credential management - Optimize performance through parallel execution where appropriate - Maintain comprehensive documentation and configuration management - Regular testing and validation of monitoring accuracy Operational Excellence: - Automate script execution through scheduling systems - Implement log rotation and maintenance procedures - Monitor the monitoring scripts themselves for reliability - Establish clear escalation procedures for different alert types - Regularly review and adjust thresholds based on system behavior Advanced Features: - Integration with external monitoring platforms - Custom metric collection for specific applications - Trend analysis and predictive alerting capabilities - Integration with configuration management systems - Support for containerized and cloud-native environments By following these guidelines and continuously refining your monitoring approach based on operational experience, you can build a robust system health monitoring solution that significantly reduces downtime risk and improves overall system reliability. Remember that monitoring is an ongoing process that requires regular review, updates, and optimization to remain effective as your infrastructure evolves. The investment in comprehensive system health monitoring pays dividends through improved system uptime, faster issue resolution, and better capacity planning capabilities. Start with the basic scripts provided in this guide, then gradually enhance them with additional features and integrations as your monitoring requirements mature.