How to monitor Kubernetes pods on Linux

How to Monitor Kubernetes Pods on Linux Kubernetes pod monitoring is a critical skill for any DevOps engineer, system administrator, or developer working with containerized applications. Effective monitoring ensures your applications run smoothly, helps identify performance bottlenecks, and enables quick troubleshooting when issues arise. This comprehensive guide will walk you through various methods and tools for monitoring Kubernetes pods on Linux systems, from basic kubectl commands to advanced monitoring solutions. Table of Contents - [Prerequisites and Requirements](#prerequisites-and-requirements) - [Understanding Kubernetes Pod Monitoring](#understanding-kubernetes-pod-monitoring) - [Basic Pod Monitoring with kubectl](#basic-pod-monitoring-with-kubectl) - [Advanced Monitoring Techniques](#advanced-monitoring-techniques) - [Log Management and Analysis](#log-management-and-analysis) - [Resource Monitoring and Metrics](#resource-monitoring-and-metrics) - [Setting Up Monitoring Tools](#setting-up-monitoring-tools) - [Automated Monitoring and Alerting](#automated-monitoring-and-alerting) - [Troubleshooting Common Issues](#troubleshooting-common-issues) - [Best Practices and Tips](#best-practices-and-tips) - [Conclusion](#conclusion) Prerequisites and Requirements Before diving into Kubernetes pod monitoring, ensure you have the following prerequisites in place: System Requirements - Linux distribution (Ubuntu 18.04+, CentOS 7+, or equivalent) - Minimum 4GB RAM and 2 CPU cores - At least 20GB available disk space - Network connectivity to your Kubernetes cluster Software Requirements - Kubernetes cluster (version 1.20 or higher recommended) - kubectl command-line tool installed and configured - Docker or containerd runtime - Basic understanding of Linux command line - Text editor (vim, nano, or your preferred editor) Access Requirements - Appropriate RBAC permissions for pod monitoring - Cluster administrator access (for some advanced features) - SSH access to cluster nodes (if monitoring node-level metrics) To verify your kubectl installation and cluster connectivity, run: ```bash kubectl version --client kubectl cluster-info kubectl get nodes ``` Understanding Kubernetes Pod Monitoring Kubernetes pod monitoring involves tracking various aspects of pod lifecycle, performance, and health. Understanding these components is crucial for effective monitoring: Key Monitoring Areas Pod Lifecycle States: Pods transition through different phases including Pending, Running, Succeeded, Failed, and Unknown. Monitoring these states helps identify deployment issues and application problems. Resource Utilization: CPU, memory, storage, and network usage metrics provide insights into application performance and resource constraints. Application Logs: Container logs contain valuable information about application behavior, errors, and performance indicators. Health Checks: Kubernetes provides liveness, readiness, and startup probes to monitor application health automatically. Monitoring Layers Effective Kubernetes monitoring operates at multiple layers: 1. Infrastructure Layer: Node health, network connectivity, storage availability 2. Platform Layer: Kubernetes API server, etcd, scheduler, controller manager 3. Application Layer: Pod status, container health, application metrics 4. Business Layer: Application-specific KPIs and business metrics Basic Pod Monitoring with kubectl The kubectl command-line tool provides fundamental monitoring capabilities that every Kubernetes administrator should master. Viewing Pod Status The most basic monitoring command displays current pod status: ```bash List all pods in the current namespace kubectl get pods List pods in all namespaces kubectl get pods --all-namespaces List pods with additional information kubectl get pods -o wide Watch pod status changes in real-time kubectl get pods --watch ``` For more detailed pod information, use the describe command: ```bash Get detailed information about a specific pod kubectl describe pod Get detailed information about all pods in a namespace kubectl describe pods --namespace= ``` Monitoring Pod Events Kubernetes events provide valuable insights into pod lifecycle changes and potential issues: ```bash View events for all resources kubectl get events View events sorted by timestamp kubectl get events --sort-by='.lastTimestamp' View events for a specific pod kubectl describe pod | grep -A 10 Events Monitor events in real-time kubectl get events --watch ``` Checking Pod Resource Usage Monitor current resource consumption using the top command: ```bash View CPU and memory usage for pods kubectl top pods View resource usage for all namespaces kubectl top pods --all-namespaces View resource usage for a specific namespace kubectl top pods --namespace= Sort pods by CPU usage kubectl top pods --sort-by=cpu Sort pods by memory usage kubectl top pods --sort-by=memory ``` Pod Status Filtering Filter pods based on their status to quickly identify problematic containers: ```bash Show only running pods kubectl get pods --field-selector=status.phase=Running Show only pending pods kubectl get pods --field-selector=status.phase=Pending Show only failed pods kubectl get pods --field-selector=status.phase=Failed Show pods with specific labels kubectl get pods -l app=nginx Show pods not ready kubectl get pods --field-selector=status.phase!=Running ``` Advanced Monitoring Techniques Beyond basic kubectl commands, several advanced techniques provide deeper insights into pod behavior and performance. Custom Resource Queries Use JSONPath queries to extract specific information from pod resources: ```bash Get pod names and their node assignments kubectl get pods -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.spec.nodeName}{"\n"}{end}' Get pod restart counts kubectl get pods -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.status.containerStatuses[0].restartCount}{"\n"}{end}' Get pod IP addresses kubectl get pods -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.status.podIP}{"\n"}{end}' Get pod creation timestamps kubectl get pods -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.metadata.creationTimestamp}{"\n"}{end}' ``` Monitoring Pod Networking Network connectivity is crucial for pod functionality. Monitor network-related aspects: ```bash Check pod network policies kubectl get networkpolicies Verify service endpoints kubectl get endpoints Check service connectivity kubectl get services Test network connectivity from within a pod kubectl exec -it -- nslookup kubernetes.default.svc.cluster.local kubectl exec -it -- wget -qO- http://kubernetes.default.svc.cluster.local ``` Storage and Volume Monitoring Monitor persistent volumes and storage usage: ```bash Check persistent volumes kubectl get pv Check persistent volume claims kubectl get pvc View storage classes kubectl get storageclass Check volume mounts for a specific pod kubectl describe pod | grep -A 5 Mounts ``` Log Management and Analysis Application logs provide critical insights into pod behavior and are essential for troubleshooting and monitoring. Basic Log Viewing Access pod logs using kubectl logs command: ```bash View logs for a single-container pod kubectl logs View logs for a specific container in a multi-container pod kubectl logs -c Follow logs in real-time kubectl logs -f View logs from the last hour kubectl logs --since=1h View last 100 lines of logs kubectl logs --tail=100 ``` Advanced Log Analysis For comprehensive log analysis, use advanced kubectl options: ```bash View logs from previous container instance (after restart) kubectl logs --previous View logs with timestamps kubectl logs --timestamps View logs from all containers in a pod kubectl logs --all-containers Save logs to a file for analysis kubectl logs > pod-logs.txt Search for specific patterns in logs kubectl logs | grep ERROR kubectl logs | grep -i "exception\|error\|fail" ``` Log Aggregation Strategies For production environments, implement log aggregation: ```bash Create a simple log collection script cat << 'EOF' > collect-pod-logs.sh #!/bin/bash NAMESPACE=${1:-default} OUTPUT_DIR="logs-$(date +%Y%m%d-%H%M%S)" mkdir -p $OUTPUT_DIR for pod in $(kubectl get pods -n $NAMESPACE -o jsonpath='{.items[*].metadata.name}'); do echo "Collecting logs for pod: $pod" kubectl logs $pod -n $NAMESPACE > "$OUTPUT_DIR/$pod.log" 2>&1 done echo "Logs collected in $OUTPUT_DIR" EOF chmod +x collect-pod-logs.sh ./collect-pod-logs.sh production ``` Resource Monitoring and Metrics Comprehensive resource monitoring helps optimize performance and prevent resource-related issues. CPU and Memory Monitoring Monitor resource utilization patterns: ```bash Continuous monitoring script cat << 'EOF' > monitor-resources.sh #!/bin/bash while true; do echo "=== $(date) ===" kubectl top pods --sort-by=memory | head -10 echo "" kubectl top pods --sort-by=cpu | head -10 echo "" sleep 30 done EOF chmod +x monitor-resources.sh ./monitor-resources.sh ``` Resource Limit Monitoring Check if pods are hitting resource limits: ```bash Check resource requests and limits kubectl describe pods | grep -A 5 -B 5 "Limits\|Requests" Identify pods without resource limits kubectl get pods -o jsonpath='{range .items[]}{.metadata.name}{"\t"}{.spec.containers[].resources.limits}{"\n"}{end}' | grep -v "cpu\|memory" Monitor resource usage vs limits kubectl top pods --containers | awk 'NR>1 {print $1, $3, $4}' ``` Disk and Storage Monitoring Monitor storage usage and availability: ```bash Check persistent volume usage kubectl get pv -o custom-columns=NAME:.metadata.name,CAPACITY:.spec.capacity.storage,STATUS:.status.phase Monitor storage classes and their usage kubectl get pvc -o custom-columns=NAME:.metadata.name,STATUS:.status.phase,CAPACITY:.status.capacity.storage Check for storage-related events kubectl get events --field-selector reason=FailedMount ``` Setting Up Monitoring Tools While kubectl provides basic monitoring capabilities, dedicated monitoring tools offer comprehensive solutions for production environments. Prometheus and Grafana Setup Prometheus is the de facto standard for Kubernetes monitoring. Here's a basic setup: ```yaml prometheus-config.yaml apiVersion: v1 kind: ConfigMap metadata: name: prometheus-config data: prometheus.yml: | global: scrape_interval: 15s scrape_configs: - job_name: 'kubernetes-pods' kubernetes_sd_configs: - role: pod relabel_configs: - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape] action: keep regex: true --- apiVersion: apps/v1 kind: Deployment metadata: name: prometheus spec: selector: matchLabels: app: prometheus template: metadata: labels: app: prometheus spec: containers: - name: prometheus image: prom/prometheus:latest ports: - containerPort: 9090 volumeMounts: - name: config mountPath: /etc/prometheus volumes: - name: config configMap: name: prometheus-config ``` Deploy Prometheus: ```bash kubectl apply -f prometheus-config.yaml kubectl expose deployment prometheus --type=NodePort --port=9090 ``` Metrics Server Installation The Metrics Server provides resource usage metrics: ```bash Install metrics server kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml Verify metrics server installation kubectl get deployment metrics-server -n kube-system Test metrics collection kubectl top nodes kubectl top pods ``` Custom Monitoring Scripts Create custom monitoring scripts for specific needs: ```bash pod-health-monitor.sh cat << 'EOF' > pod-health-monitor.sh #!/bin/bash NAMESPACE=${1:-default} THRESHOLD_CPU=80 THRESHOLD_MEMORY=80 echo "Pod Health Monitor - $(date)" echo "================================" Check pod status echo "Pod Status Summary:" kubectl get pods -n $NAMESPACE --no-headers | awk '{print $3}' | sort | uniq -c Check resource usage echo -e "\nHigh Resource Usage Pods:" kubectl top pods -n $NAMESPACE --no-headers | while read line; do pod=$(echo $line | awk '{print $1}') cpu=$(echo $line | awk '{print $2}' | sed 's/m//') memory=$(echo $line | awk '{print $3}' | sed 's/Mi//') if [[ $cpu -gt $THRESHOLD_CPU ]] || [[ $memory -gt $THRESHOLD_MEMORY ]]; then echo "WARNING: $pod - CPU: ${cpu}m, Memory: ${memory}Mi" fi done Check recent events echo -e "\nRecent Events:" kubectl get events -n $NAMESPACE --sort-by='.lastTimestamp' | tail -5 EOF chmod +x pod-health-monitor.sh ``` Automated Monitoring and Alerting Automated monitoring reduces manual oversight and ensures rapid response to issues. Health Check Automation Implement automated health checks: ```bash automated-health-check.sh cat << 'EOF' > automated-health-check.sh #!/bin/bash WEBHOOK_URL="your-slack-webhook-url" NAMESPACE="production" check_pod_health() { local failed_pods=$(kubectl get pods -n $NAMESPACE --field-selector=status.phase=Failed --no-headers | wc -l) local pending_pods=$(kubectl get pods -n $NAMESPACE --field-selector=status.phase=Pending --no-headers | wc -l) if [[ $failed_pods -gt 0 ]] || [[ $pending_pods -gt 3 ]]; then send_alert "Pod Health Alert: $failed_pods failed, $pending_pods pending pods in $NAMESPACE" fi } check_resource_usage() { kubectl top pods -n $NAMESPACE --no-headers | while read line; do pod=$(echo $line | awk '{print $1}') cpu=$(echo $line | awk '{print $2}' | sed 's/m//') memory=$(echo $line | awk '{print $3}' | sed 's/Mi//') if [[ $cpu -gt 1000 ]] || [[ $memory -gt 1000 ]]; then send_alert "High Resource Usage: $pod - CPU: ${cpu}m, Memory: ${memory}Mi" fi done } send_alert() { local message=$1 echo "$(date): $message" >> monitoring.log # Uncomment to send to Slack # curl -X POST -H 'Content-type: application/json' --data "{\"text\":\"$message\"}" $WEBHOOK_URL } check_pod_health check_resource_usage EOF chmod +x automated-health-check.sh Add to crontab for regular execution echo "/5 * /path/to/automated-health-check.sh" | crontab - ``` Log-based Alerting Monitor logs for specific patterns: ```bash log-monitor.sh cat << 'EOF' > log-monitor.sh #!/bin/bash NAMESPACE=${1:-default} ERROR_PATTERNS="ERROR|FATAL|Exception|OutOfMemory" for pod in $(kubectl get pods -n $NAMESPACE -o jsonpath='{.items[*].metadata.name}'); do error_count=$(kubectl logs $pod -n $NAMESPACE --since=5m | grep -E "$ERROR_PATTERNS" | wc -l) if [[ $error_count -gt 10 ]]; then echo "ALERT: Pod $pod has $error_count errors in the last 5 minutes" # Send alert or take action fi done EOF chmod +x log-monitor.sh ``` Troubleshooting Common Issues Understanding common pod monitoring issues and their solutions is crucial for effective Kubernetes management. Pod Status Issues Issue: Pods stuck in Pending state ```bash Diagnose pending pods kubectl describe pod kubectl get events --field-selector involvedObject.name= Common causes and solutions: 1. Insufficient resources kubectl describe nodes | grep -A 5 "Allocated resources" 2. Node selector issues kubectl describe pod | grep -A 5 "Node-Selectors" 3. Storage issues kubectl get pvc kubectl describe pvc ``` Issue: Pods in CrashLoopBackOff state ```bash Check pod logs kubectl logs --previous Check resource limits kubectl describe pod | grep -A 10 "Limits" Check health probes kubectl describe pod | grep -A 5 "Liveness\|Readiness" ``` Resource Monitoring Issues Issue: Metrics Server not working ```bash Check metrics server status kubectl get pods -n kube-system | grep metrics-server Check metrics server logs kubectl logs -n kube-system deployment/metrics-server Common fix for certificate issues kubectl patch deployment metrics-server -n kube-system --type='json' -p='[{"op": "add", "path": "/spec/template/spec/containers/0/args/-", "value": "--kubelet-insecure-tls"}]' ``` Issue: High resource usage alerts ```bash Identify resource-hungry pods kubectl top pods --sort-by=memory | head -10 kubectl top pods --sort-by=cpu | head -10 Check for memory leaks kubectl exec -it -- top kubectl exec -it -- free -h Analyze resource trends kubectl describe pod | grep -A 10 "Resource" ``` Log Collection Issues Issue: Logs not available or truncated ```bash Check container status kubectl describe pod | grep -A 10 "Container Statuses" Check log rotation settings kubectl describe node | grep -A 5 "System Info" Access logs directly from node (if needed) ssh docker logs ``` Network Monitoring Issues Issue: Pod connectivity problems ```bash Test DNS resolution kubectl exec -it -- nslookup kubernetes.default Check network policies kubectl get networkpolicies kubectl describe networkpolicy Test service connectivity kubectl exec -it -- wget -qO- http://: Check endpoints kubectl get endpoints ``` Best Practices and Tips Implementing monitoring best practices ensures reliable and efficient Kubernetes pod monitoring. Monitoring Strategy Best Practices Establish Monitoring Baselines: Before implementing alerts, establish normal operating baselines for your applications: ```bash Collect baseline metrics cat << 'EOF' > baseline-collector.sh #!/bin/bash NAMESPACE=${1:-default} DURATION=${2:-24h} echo "Collecting baseline metrics for $DURATION" kubectl top pods -n $NAMESPACE > baseline-$(date +%Y%m%d).txt sleep 3600 # Collect hourly for the specified duration EOF ``` Implement Layered Monitoring: Monitor at multiple levels - infrastructure, platform, and application: ```yaml monitoring-labels.yaml apiVersion: v1 kind: Pod metadata: name: example-app labels: app: example-app tier: frontend monitoring: enabled annotations: prometheus.io/scrape: "true" prometheus.io/port: "8080" prometheus.io/path: "/metrics" spec: containers: - name: app image: nginx ports: - containerPort: 8080 ``` Resource Management Tips Set Appropriate Resource Requests and Limits: ```yaml resource-limits-example.yaml apiVersion: v1 kind: Pod metadata: name: resource-managed-pod spec: containers: - name: app image: nginx resources: requests: memory: "64Mi" cpu: "250m" limits: memory: "128Mi" cpu: "500m" ``` Use Horizontal Pod Autoscaling: ```bash Enable HPA based on CPU usage kubectl autoscale deployment nginx-deployment --cpu-percent=50 --min=1 --max=10 Check HPA status kubectl get hpa kubectl describe hpa nginx-deployment ``` Log Management Best Practices Implement Log Rotation and Retention: ```yaml logging-config.yaml apiVersion: v1 kind: ConfigMap metadata: name: logging-config data: fluent.conf: | @type tail path /var/log/containers/*.log pos_file /var/log/fluentd-containers.log.pos tag kubernetes.* read_from_head true @type json time_format %Y-%m-%dT%H:%M:%S.%NZ > @type kubernetes_metadata > @type elasticsearch host elasticsearch.logging.svc.cluster.local port 9200 index_name kubernetes type_name _doc ``` Security Monitoring Monitor Security-Related Events: ```bash security-monitor.sh cat << 'EOF' > security-monitor.sh #!/bin/bash Monitor failed authentication attempts kubectl get events --all-namespaces | grep -i "forbidden\|unauthorized" Check for privileged containers kubectl get pods -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.spec.securityContext.privileged}{"\n"}{end}' | grep true Monitor service account usage kubectl get pods -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.spec.serviceAccountName}{"\n"}{end}' EOF ``` Performance Optimization Optimize Monitoring Overhead: ```bash Efficient resource monitoring kubectl top pods --no-headers | awk '$3+0 > 100 || $4+0 > 100 {print $1, "High Usage:", $3, $4}' Use label selectors for targeted monitoring kubectl get pods -l tier=frontend --watch Batch operations for efficiency kubectl get pods -o jsonpath='{range .items[*]}{.metadata.name}{" "}{.status.phase}{"\n"}{end}' | grep -v Running ``` Alerting Best Practices Implement Smart Alerting: ```bash intelligent-alerting.sh cat << 'EOF' > intelligent-alerting.sh #!/bin/bash ALERT_THRESHOLD=5 ALERT_WINDOW=300 # 5 minutes Count consecutive failures count_failures() { local pod=$1 local failures=0 local start_time=$(date -d '5 minutes ago' +%s) kubectl get events --field-selector involvedObject.name=$pod --output json | \ jq -r '.items[] | select(.reason == "Failed") | .firstTimestamp' | \ while read timestamp; do event_time=$(date -d "$timestamp" +%s) if [[ $event_time -gt $start_time ]]; then ((failures++)) fi done echo $failures } Check all pods and alert on persistent failures for pod in $(kubectl get pods --field-selector=status.phase=Failed -o jsonpath='{.items[*].metadata.name}'); do failures=$(count_failures $pod) if [[ $failures -gt $ALERT_THRESHOLD ]]; then echo "CRITICAL: Pod $pod has failed $failures times in the last 5 minutes" fi done EOF ``` Conclusion Effective Kubernetes pod monitoring is essential for maintaining healthy, performant applications in containerized environments. This comprehensive guide has covered the fundamental techniques and advanced strategies for monitoring pods on Linux systems, from basic kubectl commands to sophisticated monitoring solutions. Key takeaways from this guide include: - Master the Basics: Understanding kubectl commands for pod status, logs, and resource usage forms the foundation of effective monitoring - Implement Layered Monitoring: Monitor at infrastructure, platform, and application levels for comprehensive visibility - Automate When Possible: Use scripts and tools to automate routine monitoring tasks and alerting - Follow Best Practices: Set appropriate resource limits, implement proper logging, and use intelligent alerting strategies - Plan for Scale: As your Kubernetes deployment grows, invest in dedicated monitoring solutions like Prometheus and Grafana Next Steps To further enhance your Kubernetes monitoring capabilities: 1. Implement a comprehensive monitoring stack with Prometheus, Grafana, and Alertmanager 2. Set up centralized logging with the ELK stack (Elasticsearch, Logstash, Kibana) or similar solutions 3. Develop custom monitoring dashboards tailored to your specific applications and business requirements 4. Create runbooks for common monitoring scenarios and incident response procedures 5. Establish monitoring governance with clear responsibilities and escalation procedures Remember that monitoring is an ongoing process that requires continuous refinement and adjustment as your applications and infrastructure evolve. Regular review of monitoring strategies, alert thresholds, and dashboard effectiveness ensures that your monitoring solution continues to provide value and supports your operational objectives. By implementing the techniques and best practices outlined in this guide, you'll be well-equipped to maintain visibility into your Kubernetes pod health, performance, and behavior, enabling proactive management and rapid incident resolution in your containerized environment.