How to troubleshoot Kubernetes pods on Linux
How to Troubleshoot Kubernetes Pods on Linux
Kubernetes pod troubleshooting is an essential skill for DevOps engineers, system administrators, and developers working with containerized applications. When pods fail to start, crash unexpectedly, or exhibit performance issues, knowing how to diagnose and resolve these problems quickly can mean the difference between minimal downtime and extended service outages. This comprehensive guide will walk you through systematic approaches to identify, diagnose, and resolve common Kubernetes pod issues on Linux systems.
Table of Contents
1. [Prerequisites and Requirements](#prerequisites-and-requirements)
2. [Understanding Pod Lifecycle and States](#understanding-pod-lifecycle-and-states)
3. [Essential Troubleshooting Commands](#essential-troubleshooting-commands)
4. [Step-by-Step Troubleshooting Process](#step-by-step-troubleshooting-process)
5. [Common Pod Issues and Solutions](#common-pod-issues-and-solutions)
6. [Advanced Debugging Techniques](#advanced-debugging-techniques)
7. [Monitoring and Logging Best Practices](#monitoring-and-logging-best-practices)
8. [Performance Troubleshooting](#performance-troubleshooting)
9. [Network-Related Issues](#network-related-issues)
10. [Best Practices and Prevention](#best-practices-and-prevention)
11. [Conclusion](#conclusion)
Prerequisites and Requirements
Before diving into Kubernetes pod troubleshooting, ensure you have the following prerequisites in place:
System Requirements
- Linux-based system (Ubuntu 18.04+, CentOS 7+, or similar)
- Kubernetes cluster (version 1.18 or later recommended)
- kubectl command-line tool installed and configured
- Appropriate cluster access permissions (RBAC configured)
- Basic understanding of containerization concepts
Required Tools
```bash
Install kubectl if not already available
curl -LO "https://dl.k8s.io/release/$(curl -L -s https://dl.k8s.io/release/stable.txt)/bin/linux/amd64/kubectl"
chmod +x kubectl
sudo mv kubectl /usr/local/bin/
Verify installation
kubectl version --client
Additional useful tools
sudo apt-get update
sudo apt-get install -y jq curl wget net-tools
```
Access Verification
```bash
Verify cluster connectivity
kubectl cluster-info
Check your current context
kubectl config current-context
List available namespaces
kubectl get namespaces
```
Understanding Pod Lifecycle and States
Understanding Kubernetes pod lifecycle is crucial for effective troubleshooting. Pods progress through several phases during their lifecycle:
Pod Phases
1. Pending: Pod accepted by cluster but not yet scheduled or containers not created
2. Running: Pod bound to node, all containers created, at least one running
3. Succeeded: All containers terminated successfully
4. Failed: All containers terminated, at least one failed
5. Unknown: Pod state cannot be determined
Container States
- Waiting: Container not running (pulling image, applying secrets)
- Running: Container executing without issues
- Terminated: Container finished execution or failed
Checking Pod Status
```bash
Get pod status in default namespace
kubectl get pods
Get detailed pod information
kubectl get pods -o wide
Check pods in all namespaces
kubectl get pods --all-namespaces
Get pod status with additional columns
kubectl get pods -o custom-columns=NAME:.metadata.name,STATUS:.status.phase,NODE:.spec.nodeName,IP:.status.podIP
```
Essential Troubleshooting Commands
Master these fundamental kubectl commands for effective pod troubleshooting:
Basic Information Commands
```bash
Describe pod for detailed information
kubectl describe pod -n
Get pod logs
kubectl logs -n
Get logs from previous container instance
kubectl logs -n --previous
Follow logs in real-time
kubectl logs -f -n
Get logs from specific container in multi-container pod
kubectl logs -c -n
```
Advanced Information Gathering
```bash
Get pod YAML configuration
kubectl get pod -n -o yaml
Get pod JSON output for programmatic analysis
kubectl get pod -n -o json
Check pod events
kubectl get events --field-selector involvedObject.name= -n
Sort events by timestamp
kubectl get events --sort-by='.lastTimestamp' -n
```
Resource Investigation
```bash
Check resource usage
kubectl top pod -n
Check node resources
kubectl top nodes
Get pod resource requests and limits
kubectl describe pod -n | grep -A 5 -B 5 -i "requests\|limits"
```
Step-by-Step Troubleshooting Process
Follow this systematic approach when troubleshooting Kubernetes pods:
Step 1: Initial Assessment
```bash
Start with basic pod status
kubectl get pods -n
If pod exists, get detailed status
kubectl describe pod -n
Check recent events
kubectl get events -n --sort-by='.lastTimestamp' | tail -20
```
Step 2: Analyze Pod Description
The `kubectl describe` command provides comprehensive information. Focus on these sections:
```bash
kubectl describe pod -n
```
Key sections to examine:
- Status: Current pod phase
- Conditions: Detailed status conditions
- Events: Chronological list of pod events
- Containers: Individual container status
- Volumes: Storage mount information
Step 3: Examine Logs
```bash
Current logs
kubectl logs -n
Previous instance logs (for crashed containers)
kubectl logs -n --previous
Logs with timestamps
kubectl logs -n --timestamps=true
Last N lines of logs
kubectl logs -n --tail=50
```
Step 4: Check Resource Constraints
```bash
Node resource utilization
kubectl describe node
Pod resource usage
kubectl top pod -n
Check if pod is being evicted due to resource pressure
kubectl get events -n | grep -i evict
```
Step 5: Network Connectivity Testing
```bash
Execute commands inside the pod
kubectl exec -it -n -- /bin/bash
Test network connectivity from within pod
kubectl exec -n -- nslookup kubernetes.default.svc.cluster.local
Check service connectivity
kubectl exec -n -- curl -I http://..svc.cluster.local
```
Common Pod Issues and Solutions
Issue 1: ImagePullBackOff Error
Symptoms: Pod stuck in `ImagePullBackOff` or `ErrImagePull` state
Diagnosis:
```bash
kubectl describe pod -n
Look for "Failed to pull image" in events
```
Common Causes and Solutions:
1. Incorrect image name or tag:
```yaml
Fix deployment YAML
spec:
containers:
- name: myapp
image: nginx:1.21 # Ensure correct tag
```
2. Private registry authentication:
```bash
Create image pull secret
kubectl create secret docker-registry myregistrykey \
--docker-server= \
--docker-username= \
--docker-password= \
--docker-email= \
-n
Reference in pod spec
spec:
imagePullSecrets:
- name: myregistrykey
```
3. Network connectivity issues:
```bash
Test from node
docker pull
Check DNS resolution
nslookup
```
Issue 2: CrashLoopBackOff Error
Symptoms: Pod continuously restarting
Diagnosis:
```bash
kubectl describe pod -n
kubectl logs -n --previous
```
Solutions:
1. Application configuration issues:
```bash
Check environment variables
kubectl describe pod -n | grep -A 10 "Environment:"
Verify ConfigMap/Secret references
kubectl get configmap -n -o yaml
```
2. Resource constraints:
```yaml
Adjust resource limits
spec:
containers:
- name: myapp
resources:
limits:
memory: "512Mi"
cpu: "500m"
requests:
memory: "256Mi"
cpu: "250m"
```
3. Health check failures:
```yaml
Configure appropriate health checks
spec:
containers:
- name: myapp
livenessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 30
periodSeconds: 10
readinessProbe:
httpGet:
path: /ready
port: 8080
initialDelaySeconds: 5
periodSeconds: 5
```
Issue 3: Pending State
Symptoms: Pod stuck in `Pending` state
Diagnosis:
```bash
kubectl describe pod -n
kubectl get nodes -o wide
kubectl describe nodes
```
Common Causes:
1. Insufficient resources:
```bash
Check node capacity
kubectl describe nodes | grep -A 5 "Allocated resources"
Check resource requests
kubectl describe pod -n | grep -A 5 "Requests"
```
2. Node selector constraints:
```bash
Check node labels
kubectl get nodes --show-labels
Verify pod node selector
kubectl get pod -n -o yaml | grep -A 5 nodeSelector
```
3. Taints and tolerations:
```bash
Check node taints
kubectl describe nodes | grep -i taint
Add tolerations if needed
spec:
tolerations:
- key: "node-type"
operator: "Equal"
value: "gpu"
effect: "NoSchedule"
```
Advanced Debugging Techniques
Interactive Debugging
Execute shell in running pod:
```bash
Access pod shell
kubectl exec -it -n -- /bin/bash
For specific container in multi-container pod
kubectl exec -it -c -n -- /bin/bash
Run single commands
kubectl exec -n -- ps aux
kubectl exec -n -- df -h
kubectl exec -n -- netstat -tulpn
```
Debug with temporary containers (Kubernetes 1.18+):
```bash
Create ephemeral debug container
kubectl debug -it -n --image=busybox --target=
Debug node issues
kubectl debug node/ -it --image=busybox
```
Log Aggregation and Analysis
Centralized logging setup:
```bash
Check if logging is configured
kubectl get pods -n kube-system | grep -E "(fluentd|logstash|filebeat)"
View aggregated logs (if ELK stack is deployed)
kubectl port-forward -n kube-system svc/kibana 5601:5601
```
Log analysis commands:
```bash
Search for specific errors
kubectl logs -n | grep -i error
Count error occurrences
kubectl logs -n | grep -c "ERROR"
Extract logs between timestamps
kubectl logs -n --since-time=2023-01-01T10:00:00Z --until-time=2023-01-01T11:00:00Z
```
Performance Profiling
Resource monitoring:
```bash
Continuous resource monitoring
watch kubectl top pod -n
Historical resource usage (if metrics-server is available)
kubectl get --raw "/apis/metrics.k8s.io/v1beta1/namespaces//pods/" | jq
```
Application profiling:
```bash
Port forward for profiling endpoints
kubectl port-forward 6060:6060 -n
Access profiling data (for Go applications)
curl http://localhost:6060/debug/pprof/goroutine?debug=1
```
Monitoring and Logging Best Practices
Setting Up Monitoring
Deploy metrics collection:
```bash
Install metrics-server if not present
kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml
Verify metrics availability
kubectl top nodes
kubectl top pods --all-namespaces
```
Configure alerts:
```yaml
Example PrometheusRule for pod monitoring
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: pod-monitoring
spec:
groups:
- name: pod.rules
rules:
- alert: PodCrashLooping
expr: rate(kube_pod_container_status_restarts_total[15m]) > 0
for: 5m
labels:
severity: warning
annotations:
summary: "Pod {{ $labels.pod }} is crash looping"
```
Logging Configuration
Structured logging example:
```yaml
Application deployment with proper logging
apiVersion: apps/v1
kind: Deployment
metadata:
name: myapp
spec:
template:
spec:
containers:
- name: myapp
image: myapp:latest
env:
- name: LOG_LEVEL
value: "INFO"
- name: LOG_FORMAT
value: "json"
```
Log retention and rotation:
```bash
Check node log configuration
sudo ls -la /var/log/containers/
sudo cat /etc/docker/daemon.json | grep -A 5 "log-opts"
```
Performance Troubleshooting
CPU and Memory Issues
Identifying resource bottlenecks:
```bash
Check current resource usage
kubectl top pod -n
Get detailed resource information
kubectl describe pod -n | grep -A 10 -B 10 "Limits\|Requests"
Check for OOMKilled containers
kubectl describe pod -n | grep -i "oomkilled"
```
Resource optimization:
```yaml
Proper resource configuration
spec:
containers:
- name: myapp
resources:
requests:
memory: "64Mi"
cpu: "250m"
limits:
memory: "128Mi"
cpu: "500m"
```
Storage Performance
Persistent Volume troubleshooting:
```bash
Check PV and PVC status
kubectl get pv,pvc -n
Describe storage issues
kubectl describe pvc -n
Check storage class
kubectl get storageclass
```
Storage performance testing:
```bash
Test disk I/O from within pod
kubectl exec -n -- dd if=/dev/zero of=/tmp/testfile bs=1M count=100 oflag=direct
Check mount points
kubectl exec -n -- df -h
kubectl exec -n -- mount | grep -v tmpfs
```
Network-Related Issues
Service Discovery Problems
DNS troubleshooting:
```bash
Test DNS resolution from pod
kubectl exec -n -- nslookup kubernetes.default.svc.cluster.local
Check DNS configuration
kubectl exec -n -- cat /etc/resolv.conf
Test service connectivity
kubectl exec -n -- curl -v http://..svc.cluster.local
```
Service configuration verification:
```bash
Check service endpoints
kubectl get endpoints -n
Describe service for detailed information
kubectl describe service -n
Test service from different namespace
kubectl run test-pod --image=busybox -it --rm --restart=Never -- nslookup ..svc.cluster.local
```
Network Policy Issues
Network policy troubleshooting:
```bash
List network policies
kubectl get networkpolicy -n
Describe network policy
kubectl describe networkpolicy -n
Test connectivity between pods
kubectl exec -n -- nc -zv
```
Ingress and Load Balancer Issues
Ingress troubleshooting:
```bash
Check ingress status
kubectl get ingress -n
Describe ingress for backend information
kubectl describe ingress -n
Check ingress controller logs
kubectl logs -n ingress-nginx deployment/nginx-ingress-controller
```
Best Practices and Prevention
Deployment Best Practices
Health checks configuration:
```yaml
spec:
containers:
- name: myapp
livenessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 30
periodSeconds: 10
timeoutSeconds: 5
failureThreshold: 3
readinessProbe:
httpGet:
path: /ready
port: 8080
initialDelaySeconds: 5
periodSeconds: 5
timeoutSeconds: 3
failureThreshold: 3
```
Resource management:
```yaml
spec:
containers:
- name: myapp
resources:
requests:
memory: "256Mi"
cpu: "250m"
limits:
memory: "512Mi"
cpu: "500m"
```
Monitoring and Alerting
Essential metrics to monitor:
- Pod restart count
- Resource utilization (CPU, memory)
- Application-specific metrics
- Error rates and response times
Automated troubleshooting scripts:
```bash
#!/bin/bash
pod-health-check.sh
NAMESPACE=${1:-default}
POD_NAME=${2}
echo "Checking pod health for $POD_NAME in namespace $NAMESPACE"
Basic status
kubectl get pod $POD_NAME -n $NAMESPACE
Detailed information
kubectl describe pod $POD_NAME -n $NAMESPACE
Recent logs
echo "Recent logs:"
kubectl logs $POD_NAME -n $NAMESPACE --tail=20
Events
echo "Recent events:"
kubectl get events -n $NAMESPACE --field-selector involvedObject.name=$POD_NAME --sort-by='.lastTimestamp'
```
Documentation and Knowledge Sharing
Maintain troubleshooting runbooks:
- Document common issues and solutions
- Create step-by-step procedures
- Include relevant commands and configurations
- Regular updates based on new issues
Team knowledge sharing:
- Regular troubleshooting sessions
- Post-mortem analysis of incidents
- Shared troubleshooting tools and scripts
- Cross-training on different components
Conclusion
Effective Kubernetes pod troubleshooting requires a systematic approach, combining fundamental understanding of pod lifecycle with practical debugging skills. The key to successful troubleshooting lies in following a structured methodology: starting with basic status checks, analyzing pod descriptions and logs, investigating resource constraints, and testing network connectivity.
Remember these essential principles:
1. Start with the basics: Always begin with `kubectl get pods` and `kubectl describe pod`
2. Follow the logs: Application logs often contain the most valuable debugging information
3. Check resources: Many issues stem from resource constraints or misconfigurations
4. Test connectivity: Network issues are common in distributed systems
5. Use the right tools: Familiarize yourself with kubectl commands and debugging utilities
6. Document solutions: Keep track of common issues and their resolutions
As you gain experience with Kubernetes troubleshooting, you'll develop intuition for quickly identifying and resolving issues. The commands and techniques outlined in this guide provide a solid foundation for handling most pod-related problems you'll encounter in production environments.
Continue expanding your troubleshooting skills by staying updated with Kubernetes releases, participating in community discussions, and practicing with different scenarios in test environments. The investment in mastering these skills will pay dividends in maintaining reliable, performant Kubernetes deployments.
For further learning, consider exploring advanced topics such as custom resource troubleshooting, operator debugging, and cluster-level issue resolution. The Kubernetes ecosystem is vast and continuously evolving, making ongoing learning essential for effective cluster management and troubleshooting.