How to troubleshoot Kubernetes pods on Linux | Kubernetes & Orchestration Tutorial

How to Troubleshoot Kubernetes Pods on Linux Kubernetes pod troubleshooting is an essential skill for DevOps engineers, system administrators, and developers working with containerized applications. When pods fail to start, crash unexpectedly, or exhibit performance issues, knowing how to diagnose and resolve these problems quickly can mean the difference between minimal downtime and extended service outages. This comprehensive guide will walk you through systematic approaches to identify, diagnose, and resolve common Kubernetes pod issues on Linux systems. Table of Contents 1. [Prerequisites and Requirements](#prerequisites-and-requirements) 2. [Understanding Pod Lifecycle and States](#understanding-pod-lifecycle-and-states) 3. [Essential Troubleshooting Commands](#essential-troubleshooting-commands) 4. [Step-by-Step Troubleshooting Process](#step-by-step-troubleshooting-process) 5. [Common Pod Issues and Solutions](#common-pod-issues-and-solutions) 6. [Advanced Debugging Techniques](#advanced-debugging-techniques) 7. [Monitoring and Logging Best Practices](#monitoring-and-logging-best-practices) 8. [Performance Troubleshooting](#performance-troubleshooting) 9. [Network-Related Issues](#network-related-issues) 10. [Best Practices and Prevention](#best-practices-and-prevention) 11. [Conclusion](#conclusion) Prerequisites and Requirements Before diving into Kubernetes pod troubleshooting, ensure you have the following prerequisites in place: System Requirements - Linux-based system (Ubuntu 18.04+, CentOS 7+, or similar) - Kubernetes cluster (version 1.18 or later recommended) - kubectl command-line tool installed and configured - Appropriate cluster access permissions (RBAC configured) - Basic understanding of containerization concepts Required Tools ```bash Install kubectl if not already available curl -LO "https://dl.k8s.io/release/$(curl -L -s https://dl.k8s.io/release/stable.txt)/bin/linux/amd64/kubectl" chmod +x kubectl sudo mv kubectl /usr/local/bin/ Verify installation kubectl version --client Additional useful tools sudo apt-get update sudo apt-get install -y jq curl wget net-tools ``` Access Verification ```bash Verify cluster connectivity kubectl cluster-info Check your current context kubectl config current-context List available namespaces kubectl get namespaces ``` Understanding Pod Lifecycle and States Understanding Kubernetes pod lifecycle is crucial for effective troubleshooting. Pods progress through several phases during their lifecycle: Pod Phases 1. Pending: Pod accepted by cluster but not yet scheduled or containers not created 2. Running: Pod bound to node, all containers created, at least one running 3. Succeeded: All containers terminated successfully 4. Failed: All containers terminated, at least one failed 5. Unknown: Pod state cannot be determined Container States - Waiting: Container not running (pulling image, applying secrets) - Running: Container executing without issues - Terminated: Container finished execution or failed Checking Pod Status ```bash Get pod status in default namespace kubectl get pods Get detailed pod information kubectl get pods -o wide Check pods in all namespaces kubectl get pods --all-namespaces Get pod status with additional columns kubectl get pods -o custom-columns=NAME:.metadata.name,STATUS:.status.phase,NODE:.spec.nodeName,IP:.status.podIP ``` Essential Troubleshooting Commands Master these fundamental kubectl commands for effective pod troubleshooting: Basic Information Commands ```bash Describe pod for detailed information kubectl describe pod -n Get pod logs kubectl logs -n Get logs from previous container instance kubectl logs -n --previous Follow logs in real-time kubectl logs -f -n Get logs from specific container in multi-container pod kubectl logs -c -n ``` Advanced Information Gathering ```bash Get pod YAML configuration kubectl get pod -n -o yaml Get pod JSON output for programmatic analysis kubectl get pod -n -o json Check pod events kubectl get events --field-selector involvedObject.name= -n Sort events by timestamp kubectl get events --sort-by='.lastTimestamp' -n ``` Resource Investigation ```bash Check resource usage kubectl top pod -n Check node resources kubectl top nodes Get pod resource requests and limits kubectl describe pod -n | grep -A 5 -B 5 -i "requests\|limits" ``` Step-by-Step Troubleshooting Process Follow this systematic approach when troubleshooting Kubernetes pods: Step 1: Initial Assessment ```bash Start with basic pod status kubectl get pods -n If pod exists, get detailed status kubectl describe pod -n Check recent events kubectl get events -n --sort-by='.lastTimestamp' | tail -20 ``` Step 2: Analyze Pod Description The `kubectl describe` command provides comprehensive information. Focus on these sections: ```bash kubectl describe pod -n ``` Key sections to examine: - Status: Current pod phase - Conditions: Detailed status conditions - Events: Chronological list of pod events - Containers: Individual container status - Volumes: Storage mount information Step 3: Examine Logs ```bash Current logs kubectl logs -n Previous instance logs (for crashed containers) kubectl logs -n --previous Logs with timestamps kubectl logs -n --timestamps=true Last N lines of logs kubectl logs -n --tail=50 ``` Step 4: Check Resource Constraints ```bash Node resource utilization kubectl describe node Pod resource usage kubectl top pod -n Check if pod is being evicted due to resource pressure kubectl get events -n | grep -i evict ``` Step 5: Network Connectivity Testing ```bash Execute commands inside the pod kubectl exec -it -n -- /bin/bash Test network connectivity from within pod kubectl exec -n -- nslookup kubernetes.default.svc.cluster.local Check service connectivity kubectl exec -n -- curl -I http://..svc.cluster.local ``` Common Pod Issues and Solutions Issue 1: ImagePullBackOff Error Symptoms: Pod stuck in `ImagePullBackOff` or `ErrImagePull` state Diagnosis: ```bash kubectl describe pod -n Look for "Failed to pull image" in events ``` Common Causes and Solutions: 1. Incorrect image name or tag: ```yaml Fix deployment YAML spec: containers: - name: myapp image: nginx:1.21 # Ensure correct tag ``` 2. Private registry authentication: ```bash Create image pull secret kubectl create secret docker-registry myregistrykey \ --docker-server= \ --docker-username= \ --docker-password= \ --docker-email= \ -n Reference in pod spec spec: imagePullSecrets: - name: myregistrykey ``` 3. Network connectivity issues: ```bash Test from node docker pull Check DNS resolution nslookup ``` Issue 2: CrashLoopBackOff Error Symptoms: Pod continuously restarting Diagnosis: ```bash kubectl describe pod -n kubectl logs -n --previous ``` Solutions: 1. Application configuration issues: ```bash Check environment variables kubectl describe pod -n | grep -A 10 "Environment:" Verify ConfigMap/Secret references kubectl get configmap -n -o yaml ``` 2. Resource constraints: ```yaml Adjust resource limits spec: containers: - name: myapp resources: limits: memory: "512Mi" cpu: "500m" requests: memory: "256Mi" cpu: "250m" ``` 3. Health check failures: ```yaml Configure appropriate health checks spec: containers: - name: myapp livenessProbe: httpGet: path: /health port: 8080 initialDelaySeconds: 30 periodSeconds: 10 readinessProbe: httpGet: path: /ready port: 8080 initialDelaySeconds: 5 periodSeconds: 5 ``` Issue 3: Pending State Symptoms: Pod stuck in `Pending` state Diagnosis: ```bash kubectl describe pod -n kubectl get nodes -o wide kubectl describe nodes ``` Common Causes: 1. Insufficient resources: ```bash Check node capacity kubectl describe nodes | grep -A 5 "Allocated resources" Check resource requests kubectl describe pod -n | grep -A 5 "Requests" ``` 2. Node selector constraints: ```bash Check node labels kubectl get nodes --show-labels Verify pod node selector kubectl get pod -n -o yaml | grep -A 5 nodeSelector ``` 3. Taints and tolerations: ```bash Check node taints kubectl describe nodes | grep -i taint Add tolerations if needed spec: tolerations: - key: "node-type" operator: "Equal" value: "gpu" effect: "NoSchedule" ``` Advanced Debugging Techniques Interactive Debugging Execute shell in running pod: ```bash Access pod shell kubectl exec -it -n -- /bin/bash For specific container in multi-container pod kubectl exec -it -c -n -- /bin/bash Run single commands kubectl exec -n -- ps aux kubectl exec -n -- df -h kubectl exec -n -- netstat -tulpn ``` Debug with temporary containers (Kubernetes 1.18+): ```bash Create ephemeral debug container kubectl debug -it -n --image=busybox --target= Debug node issues kubectl debug node/ -it --image=busybox ``` Log Aggregation and Analysis Centralized logging setup: ```bash Check if logging is configured kubectl get pods -n kube-system | grep -E "(fluentd|logstash|filebeat)" View aggregated logs (if ELK stack is deployed) kubectl port-forward -n kube-system svc/kibana 5601:5601 ``` Log analysis commands: ```bash Search for specific errors kubectl logs -n | grep -i error Count error occurrences kubectl logs -n | grep -c "ERROR" Extract logs between timestamps kubectl logs -n --since-time=2023-01-01T10:00:00Z --until-time=2023-01-01T11:00:00Z ``` Performance Profiling Resource monitoring: ```bash Continuous resource monitoring watch kubectl top pod -n Historical resource usage (if metrics-server is available) kubectl get --raw "/apis/metrics.k8s.io/v1beta1/namespaces//pods/" | jq ``` Application profiling: ```bash Port forward for profiling endpoints kubectl port-forward 6060:6060 -n Access profiling data (for Go applications) curl http://localhost:6060/debug/pprof/goroutine?debug=1 ``` Monitoring and Logging Best Practices Setting Up Monitoring Deploy metrics collection: ```bash Install metrics-server if not present kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml Verify metrics availability kubectl top nodes kubectl top pods --all-namespaces ``` Configure alerts: ```yaml Example PrometheusRule for pod monitoring apiVersion: monitoring.coreos.com/v1 kind: PrometheusRule metadata: name: pod-monitoring spec: groups: - name: pod.rules rules: - alert: PodCrashLooping expr: rate(kube_pod_container_status_restarts_total[15m]) > 0 for: 5m labels: severity: warning annotations: summary: "Pod {{ $labels.pod }} is crash looping" ``` Logging Configuration Structured logging example: ```yaml Application deployment with proper logging apiVersion: apps/v1 kind: Deployment metadata: name: myapp spec: template: spec: containers: - name: myapp image: myapp:latest env: - name: LOG_LEVEL value: "INFO" - name: LOG_FORMAT value: "json" ``` Log retention and rotation: ```bash Check node log configuration sudo ls -la /var/log/containers/ sudo cat /etc/docker/daemon.json | grep -A 5 "log-opts" ``` Performance Troubleshooting CPU and Memory Issues Identifying resource bottlenecks: ```bash Check current resource usage kubectl top pod -n Get detailed resource information kubectl describe pod -n | grep -A 10 -B 10 "Limits\|Requests" Check for OOMKilled containers kubectl describe pod -n | grep -i "oomkilled" ``` Resource optimization: ```yaml Proper resource configuration spec: containers: - name: myapp resources: requests: memory: "64Mi" cpu: "250m" limits: memory: "128Mi" cpu: "500m" ``` Storage Performance Persistent Volume troubleshooting: ```bash Check PV and PVC status kubectl get pv,pvc -n Describe storage issues kubectl describe pvc -n Check storage class kubectl get storageclass ``` Storage performance testing: ```bash Test disk I/O from within pod kubectl exec -n -- dd if=/dev/zero of=/tmp/testfile bs=1M count=100 oflag=direct Check mount points kubectl exec -n -- df -h kubectl exec -n -- mount | grep -v tmpfs ``` Network-Related Issues Service Discovery Problems DNS troubleshooting: ```bash Test DNS resolution from pod kubectl exec -n -- nslookup kubernetes.default.svc.cluster.local Check DNS configuration kubectl exec -n -- cat /etc/resolv.conf Test service connectivity kubectl exec -n -- curl -v http://..svc.cluster.local ``` Service configuration verification: ```bash Check service endpoints kubectl get endpoints -n Describe service for detailed information kubectl describe service -n Test service from different namespace kubectl run test-pod --image=busybox -it --rm --restart=Never -- nslookup ..svc.cluster.local ``` Network Policy Issues Network policy troubleshooting: ```bash List network policies kubectl get networkpolicy -n Describe network policy kubectl describe networkpolicy -n Test connectivity between pods kubectl exec -n -- nc -zv ``` Ingress and Load Balancer Issues Ingress troubleshooting: ```bash Check ingress status kubectl get ingress -n Describe ingress for backend information kubectl describe ingress -n Check ingress controller logs kubectl logs -n ingress-nginx deployment/nginx-ingress-controller ``` Best Practices and Prevention Deployment Best Practices Health checks configuration: ```yaml spec: containers: - name: myapp livenessProbe: httpGet: path: /health port: 8080 initialDelaySeconds: 30 periodSeconds: 10 timeoutSeconds: 5 failureThreshold: 3 readinessProbe: httpGet: path: /ready port: 8080 initialDelaySeconds: 5 periodSeconds: 5 timeoutSeconds: 3 failureThreshold: 3 ``` Resource management: ```yaml spec: containers: - name: myapp resources: requests: memory: "256Mi" cpu: "250m" limits: memory: "512Mi" cpu: "500m" ``` Monitoring and Alerting Essential metrics to monitor: - Pod restart count - Resource utilization (CPU, memory) - Application-specific metrics - Error rates and response times Automated troubleshooting scripts: ```bash #!/bin/bash pod-health-check.sh NAMESPACE=${1:-default} POD_NAME=${2} echo "Checking pod health for $POD_NAME in namespace $NAMESPACE" Basic status kubectl get pod $POD_NAME -n $NAMESPACE Detailed information kubectl describe pod $POD_NAME -n $NAMESPACE Recent logs echo "Recent logs:" kubectl logs $POD_NAME -n $NAMESPACE --tail=20 Events echo "Recent events:" kubectl get events -n $NAMESPACE --field-selector involvedObject.name=$POD_NAME --sort-by='.lastTimestamp' ``` Documentation and Knowledge Sharing Maintain troubleshooting runbooks: - Document common issues and solutions - Create step-by-step procedures - Include relevant commands and configurations - Regular updates based on new issues Team knowledge sharing: - Regular troubleshooting sessions - Post-mortem analysis of incidents - Shared troubleshooting tools and scripts - Cross-training on different components Conclusion Effective Kubernetes pod troubleshooting requires a systematic approach, combining fundamental understanding of pod lifecycle with practical debugging skills. The key to successful troubleshooting lies in following a structured methodology: starting with basic status checks, analyzing pod descriptions and logs, investigating resource constraints, and testing network connectivity. Remember these essential principles: 1. Start with the basics: Always begin with `kubectl get pods` and `kubectl describe pod` 2. Follow the logs: Application logs often contain the most valuable debugging information 3. Check resources: Many issues stem from resource constraints or misconfigurations 4. Test connectivity: Network issues are common in distributed systems 5. Use the right tools: Familiarize yourself with kubectl commands and debugging utilities 6. Document solutions: Keep track of common issues and their resolutions As you gain experience with Kubernetes troubleshooting, you'll develop intuition for quickly identifying and resolving issues. The commands and techniques outlined in this guide provide a solid foundation for handling most pod-related problems you'll encounter in production environments. Continue expanding your troubleshooting skills by staying updated with Kubernetes releases, participating in community discussions, and practicing with different scenarios in test environments. The investment in mastering these skills will pay dividends in maintaining reliable, performant Kubernetes deployments. For further learning, consider exploring advanced topics such as custom resource troubleshooting, operator debugging, and cluster-level issue resolution. The Kubernetes ecosystem is vast and continuously evolving, making ongoing learning essential for effective cluster management and troubleshooting.