How to monitor Kubernetes with Prometheus on Linux
How to Monitor Kubernetes with Prometheus on Linux
Monitoring Kubernetes clusters is essential for maintaining application performance, ensuring system reliability, and troubleshooting issues before they impact users. Prometheus, an open-source monitoring and alerting toolkit, has become the de facto standard for Kubernetes monitoring due to its powerful metrics collection capabilities and seamless integration with cloud-native technologies.
This comprehensive guide will walk you through setting up Prometheus to monitor your Kubernetes cluster on Linux, from initial installation to advanced configuration and troubleshooting. Whether you're a DevOps engineer, system administrator, or developer working with containerized applications, this tutorial provides the knowledge needed to implement robust monitoring solutions.
Prerequisites and Requirements
Before diving into the setup process, ensure you have the following components ready:
System Requirements
- Linux Distribution: Ubuntu 18.04+, CentOS 7+, or RHEL 7+
- CPU: Minimum 2 cores (4+ recommended for production)
- Memory: 4GB RAM minimum (8GB+ for production environments)
- Storage: 50GB available disk space (more for metric retention)
- Network: Stable internet connection for downloading components
Required Software
- Kubernetes Cluster: Version 1.18 or later (can be minikube, kubeadm, or managed service)
- kubectl: Configured to communicate with your cluster
- Helm: Version 3.0+ (recommended for easier deployment)
- Docker: For container runtime (if not using containerd)
Access Requirements
- Cluster administrator privileges
- Ability to create namespaces, deployments, and services
- Network access to cluster nodes and pods
Verification Commands
Before proceeding, verify your environment:
```bash
Check Kubernetes cluster status
kubectl cluster-info
Verify node readiness
kubectl get nodes
Check available resources
kubectl top nodes
Confirm Helm installation
helm version
```
Understanding Prometheus Architecture in Kubernetes
Prometheus operates on a pull-based model, periodically scraping metrics from configured targets. In a Kubernetes environment, the architecture typically includes:
Core Components
1. Prometheus Server: Central component that scrapes and stores metrics
2. Node Exporter: Collects hardware and OS metrics from cluster nodes
3. kube-state-metrics: Exposes cluster-level metrics about Kubernetes objects
4. cAdvisor: Built into kubelet, provides container resource usage metrics
5. Alertmanager: Handles alerts sent by Prometheus server
Service Discovery
Kubernetes service discovery allows Prometheus to automatically discover and monitor:
- Pods with specific annotations
- Services exposing metrics endpoints
- Nodes in the cluster
- API server metrics
Step-by-Step Installation Guide
Method 1: Using Helm Charts (Recommended)
Helm provides the most straightforward way to deploy Prometheus with sensible defaults and easy customization options.
Step 1: Add Prometheus Helm Repository
```bash
Add the Prometheus community Helm repository
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
Update Helm repositories
helm repo update
Verify repository addition
helm search repo prometheus
```
Step 2: Create Monitoring Namespace
```bash
Create dedicated namespace for monitoring components
kubectl create namespace monitoring
Verify namespace creation
kubectl get namespaces
```
Step 3: Install Prometheus Stack
```bash
Install kube-prometheus-stack (includes Prometheus, Grafana, and Alertmanager)
helm install prometheus prometheus-community/kube-prometheus-stack \
--namespace monitoring \
--set prometheus.prometheusSpec.storageSpec.volumeClaimTemplate.spec.storageClassName=default \
--set prometheus.prometheusSpec.storageSpec.volumeClaimTemplate.spec.accessModes[0]=ReadWriteOnce \
--set prometheus.prometheusSpec.storageSpec.volumeClaimTemplate.spec.resources.requests.storage=50Gi
Monitor installation progress
kubectl get pods -n monitoring -w
```
Step 4: Verify Installation
```bash
Check all monitoring components
kubectl get all -n monitoring
Verify Prometheus server is running
kubectl get pods -n monitoring | grep prometheus-prometheus
Check services
kubectl get svc -n monitoring
```
Method 2: Manual YAML Deployment
For more control over the configuration, you can deploy Prometheus using custom YAML manifests.
Step 1: Create Service Account and RBAC
```yaml
prometheus-rbac.yaml
apiVersion: v1
kind: ServiceAccount
metadata:
name: prometheus
namespace: monitoring
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: prometheus
rules:
- apiGroups: [""]
resources:
- nodes
- nodes/proxy
- services
- endpoints
- pods
verbs: ["get", "list", "watch"]
- apiGroups:
- extensions
resources:
- ingresses
verbs: ["get", "list", "watch"]
- nonResourceURLs: ["/metrics"]
verbs: ["get"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: prometheus
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: prometheus
subjects:
- kind: ServiceAccount
name: prometheus
namespace: monitoring
```
Step 2: Create ConfigMap for Prometheus Configuration
```yaml
prometheus-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: prometheus-config
namespace: monitoring
data:
prometheus.yml: |
global:
scrape_interval: 15s
evaluation_interval: 15s
rule_files:
# - "first_rules.yml"
# - "second_rules.yml"
scrape_configs:
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
- job_name: 'kubernetes-apiservers'
kubernetes_sd_configs:
- role: endpoints
scheme: https
tls_config:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
relabel_configs:
- source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name]
action: keep
regex: default;kubernetes;https
- job_name: 'kubernetes-nodes'
kubernetes_sd_configs:
- role: node
scheme: https
tls_config:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
relabel_configs:
- action: labelmap
regex: __meta_kubernetes_node_label_(.+)
- target_label: __address__
replacement: kubernetes.default.svc:443
- source_labels: [__meta_kubernetes_node_name]
regex: (.+)
target_label: __metrics_path__
replacement: /api/v1/nodes/${1}/proxy/metrics
- job_name: 'kubernetes-pods'
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
action: replace
target_label: __metrics_path__
regex: (.+)
```
Step 3: Deploy Prometheus Server
```yaml
prometheus-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: prometheus-deployment
namespace: monitoring
labels:
app: prometheus-server
spec:
replicas: 1
selector:
matchLabels:
app: prometheus-server
template:
metadata:
labels:
app: prometheus-server
spec:
serviceAccountName: prometheus
containers:
- name: prometheus
image: prom/prometheus:latest
args:
- '--storage.tsdb.retention.time=12h'
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.path=/prometheus/'
- '--web.console.libraries=/etc/prometheus/console_libraries'
- '--web.console.templates=/etc/prometheus/consoles'
- '--web.enable-lifecycle'
ports:
- containerPort: 9090
resources:
requests:
cpu: 500m
memory: 500M
limits:
cpu: 1
memory: 1Gi
volumeMounts:
- name: prometheus-config-volume
mountPath: /etc/prometheus/
- name: prometheus-storage-volume
mountPath: /prometheus/
volumes:
- name: prometheus-config-volume
configMap:
defaultMode: 420
name: prometheus-config
- name: prometheus-storage-volume
emptyDir: {}
```
Step 4: Apply Configurations
```bash
Apply all configurations
kubectl apply -f prometheus-rbac.yaml
kubectl apply -f prometheus-config.yaml
kubectl apply -f prometheus-deployment.yaml
Create service for Prometheus
kubectl expose deployment prometheus-deployment --port=9090 --target-port=9090 --name=prometheus-service --namespace=monitoring
```
Configuring Node Exporter
Node Exporter provides detailed metrics about the underlying infrastructure, including CPU, memory, disk, and network statistics.
Deploy Node Exporter as DaemonSet
```yaml
node-exporter.yaml
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: node-exporter
namespace: monitoring
labels:
app: node-exporter
spec:
selector:
matchLabels:
app: node-exporter
template:
metadata:
labels:
app: node-exporter
spec:
hostPID: true
hostIPC: true
hostNetwork: true
containers:
- name: node-exporter
image: prom/node-exporter:latest
args:
- '--path.procfs=/host/proc'
- '--path.sysfs=/host/sys'
- '--collector.filesystem.ignored-mount-points=^/(sys|proc|dev|host|etc)($|/)'
ports:
- containerPort: 9100
hostPort: 9100
name: metrics
resources:
requests:
memory: 30Mi
cpu: 100m
limits:
memory: 50Mi
cpu: 200m
volumeMounts:
- name: proc
mountPath: /host/proc
readOnly: true
- name: sys
mountPath: /host/sys
readOnly: true
- name: root
mountPath: /rootfs
readOnly: true
volumes:
- name: proc
hostPath:
path: /proc
- name: sys
hostPath:
path: /sys
- name: root
hostPath:
path: /
tolerations:
- effect: NoSchedule
operator: Exists
```
Apply the Node Exporter configuration:
```bash
kubectl apply -f node-exporter.yaml
Verify Node Exporter pods are running on all nodes
kubectl get pods -n monitoring -o wide | grep node-exporter
```
Setting Up kube-state-metrics
kube-state-metrics provides insights into the state of Kubernetes objects like deployments, pods, and services.
```yaml
kube-state-metrics.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: kube-state-metrics
namespace: monitoring
spec:
replicas: 1
selector:
matchLabels:
app: kube-state-metrics
template:
metadata:
labels:
app: kube-state-metrics
spec:
serviceAccountName: kube-state-metrics
containers:
- name: kube-state-metrics
image: k8s.gcr.io/kube-state-metrics/kube-state-metrics:v2.6.0
ports:
- containerPort: 8080
name: http-metrics
- containerPort: 8081
name: telemetry
livenessProbe:
httpGet:
path: /healthz
port: 8080
initialDelaySeconds: 5
timeoutSeconds: 5
readinessProbe:
httpGet:
path: /
port: 8081
initialDelaySeconds: 5
timeoutSeconds: 5
---
apiVersion: v1
kind: ServiceAccount
metadata:
name: kube-state-metrics
namespace: monitoring
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: kube-state-metrics
rules:
- apiGroups: [""]
resources:
- configmaps
- secrets
- nodes
- pods
- services
- resourcequotas
- replicationcontrollers
- limitranges
- persistentvolumeclaims
- persistentvolumes
- namespaces
- endpoints
verbs: ["list", "watch"]
- apiGroups: ["apps"]
resources:
- statefulsets
- daemonsets
- deployments
- replicasets
verbs: ["list", "watch"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: kube-state-metrics
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: kube-state-metrics
subjects:
- kind: ServiceAccount
name: kube-state-metrics
namespace: monitoring
```
Accessing Prometheus Dashboard
Port Forwarding Method
The quickest way to access Prometheus is through port forwarding:
```bash
Forward Prometheus port to local machine
kubectl port-forward -n monitoring svc/prometheus-kube-prometheus-prometheus 9090:9090
Access Prometheus at http://localhost:9090
```
LoadBalancer Service Method
For persistent access, create a LoadBalancer service:
```yaml
prometheus-loadbalancer.yaml
apiVersion: v1
kind: Service
metadata:
name: prometheus-loadbalancer
namespace: monitoring
spec:
type: LoadBalancer
ports:
- port: 9090
targetPort: 9090
protocol: TCP
selector:
app.kubernetes.io/name: prometheus
```
Ingress Method
For production environments, use an Ingress controller:
```yaml
prometheus-ingress.yaml
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: prometheus-ingress
namespace: monitoring
annotations:
kubernetes.io/ingress.class: nginx
nginx.ingress.kubernetes.io/rewrite-target: /
spec:
rules:
- host: prometheus.yourdomain.com
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: prometheus-kube-prometheus-prometheus
port:
number: 9090
```
Essential Queries and Metrics
Once Prometheus is running, you can start exploring metrics using PromQL (Prometheus Query Language).
Basic System Metrics
```promql
CPU usage per node
100 - (avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) by (instance) * 100)
Memory usage per node
(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100
Disk usage per node
100 - ((node_filesystem_avail_bytes{mountpoint="/",fstype!="rootfs"} / node_filesystem_size_bytes{mountpoint="/",fstype!="rootfs"}) * 100)
Network I/O per node
rate(node_network_receive_bytes_total[5m])
rate(node_network_transmit_bytes_total[5m])
```
Kubernetes-Specific Metrics
```promql
Pod CPU usage
rate(container_cpu_usage_seconds_total{container!="POD",container!=""}[5m])
Pod memory usage
container_memory_working_set_bytes{container!="POD",container!=""}
Number of pods per namespace
count(kube_pod_info) by (namespace)
Pod restart count
increase(kube_pod_container_status_restarts_total[1h])
Node resource allocation
(kube_node_status_allocatable{resource="cpu"} - kube_node_status_capacity{resource="cpu"}) / kube_node_status_capacity{resource="cpu"}
```
Application Metrics
```promql
HTTP request rate
rate(http_requests_total[5m])
HTTP error rate
rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m])
Request duration
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))
```
Configuring Alerting Rules
Create alerting rules to proactively monitor your cluster:
```yaml
prometheus-rules.yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: kubernetes-alerts
namespace: monitoring
spec:
groups:
- name: kubernetes.rules
rules:
- alert: KubernetesPodCrashLooping
expr: rate(kube_pod_container_status_restarts_total[10m]) 60 10 > 0
for: 2m
labels:
severity: warning
annotations:
summary: Pod is crash looping
description: "Pod {{ $labels.namespace }}/{{ $labels.pod }} is crash looping"
- alert: KubernetesNodeNotReady
expr: kube_node_status_condition{condition="Ready",status="true"} == 0
for: 10m
labels:
severity: critical
annotations:
summary: Kubernetes node not ready
description: "Node {{ $labels.node }} has been unready for more than 10 minutes"
- alert: HighCPUUsage
expr: 100 - (avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) by (instance) * 100) > 80
for: 5m
labels:
severity: warning
annotations:
summary: High CPU usage detected
description: "CPU usage is above 80% on {{ $labels.instance }}"
- alert: HighMemoryUsage
expr: (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 > 85
for: 5m
labels:
severity: warning
annotations:
summary: High memory usage detected
description: "Memory usage is above 85% on {{ $labels.instance }}"
```
Apply the alerting rules:
```bash
kubectl apply -f prometheus-rules.yaml
```
Common Troubleshooting Issues
Issue 1: Prometheus Not Scraping Targets
Symptoms: Targets showing as "DOWN" in Prometheus UI
Solutions:
```bash
Check service discovery
kubectl get endpoints -n monitoring
Verify network policies
kubectl get networkpolicies -A
Check pod logs
kubectl logs -n monitoring prometheus-prometheus-kube-prometheus-prometheus-0
Test connectivity
kubectl exec -n monitoring prometheus-prometheus-kube-prometheus-prometheus-0 -- wget -qO- http://target-service:port/metrics
```
Issue 2: High Memory Usage
Symptoms: Prometheus pod getting OOMKilled
Solutions:
```bash
Increase memory limits
kubectl patch prometheus prometheus-kube-prometheus-prometheus -n monitoring --type='merge' -p='{"spec":{"resources":{"limits":{"memory":"4Gi"}}}}'
Reduce retention time
kubectl patch prometheus prometheus-kube-prometheus-prometheus -n monitoring --type='merge' -p='{"spec":{"retention":"7d"}}'
Optimize queries and reduce cardinality
```
Issue 3: Storage Issues
Symptoms: "no space left on device" errors
Solutions:
```bash
Check disk usage
kubectl exec -n monitoring prometheus-prometheus-kube-prometheus-prometheus-0 -- df -h
Increase storage size
kubectl patch pvc prometheus-prometheus-kube-prometheus-prometheus-db-prometheus-prometheus-kube-prometheus-prometheus-0 -n monitoring -p='{"spec":{"resources":{"requests":{"storage":"100Gi"}}}}'
Clean up old data
kubectl exec -n monitoring prometheus-prometheus-kube-prometheus-prometheus-0 -- rm -rf /prometheus/01*
```
Issue 4: Service Discovery Problems
Symptoms: Missing metrics from certain services
Solutions:
```bash
Check RBAC permissions
kubectl auth can-i get pods --as=system:serviceaccount:monitoring:prometheus
Verify annotations on pods
kubectl get pods -o yaml | grep -A5 -B5 prometheus.io
Check Prometheus configuration
kubectl get prometheus prometheus-kube-prometheus-prometheus -n monitoring -o yaml
```
Performance Optimization and Best Practices
Resource Management
1. CPU and Memory Sizing:
```yaml
resources:
requests:
cpu: 1000m
memory: 2Gi
limits:
cpu: 2000m
memory: 4Gi
```
2. Storage Configuration:
```yaml
storageSpec:
volumeClaimTemplate:
spec:
storageClassName: fast-ssd
resources:
requests:
storage: 100Gi
```
Query Optimization
1. Use Recording Rules for frequently used complex queries:
```yaml
- record: node:cpu_utilization:rate5m
expr: 100 - (avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) by (instance) * 100)
```
2. Limit Query Range and avoid high cardinality metrics:
```promql
Good: Limited time range
rate(http_requests_total[5m])
Avoid: Unbounded queries
http_requests_total
```
Security Best Practices
1. Enable RBAC with minimal required permissions
2. Use Network Policies to restrict access
3. Implement Authentication for Prometheus UI
4. Encrypt Communication between components
```yaml
Network policy example
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: prometheus-netpol
namespace: monitoring
spec:
podSelector:
matchLabels:
app.kubernetes.io/name: prometheus
policyTypes:
- Ingress
- Egress
ingress:
- from:
- namespaceSelector:
matchLabels:
name: monitoring
ports:
- protocol: TCP
port: 9090
```
Monitoring Strategy
1. Implement the Four Golden Signals:
- Latency: Response time metrics
- Traffic: Request rate metrics
- Errors: Error rate metrics
- Saturation: Resource utilization metrics
2. Set Up Proper Alerting:
- Create meaningful alert rules
- Avoid alert fatigue
- Implement escalation policies
3. Regular Maintenance:
- Monitor Prometheus itself
- Regular backups of configuration
- Update components regularly
Integration with Grafana
While Prometheus excels at data collection and alerting, Grafana provides superior visualization capabilities.
Install Grafana
```bash
Add Grafana Helm repository
helm repo add grafana https://grafana.github.io/helm-charts
Install Grafana
helm install grafana grafana/grafana \
--namespace monitoring \
--set persistence.enabled=true \
--set persistence.size=10Gi \
--set adminPassword=admin123
```
Configure Prometheus Data Source
1. Access Grafana dashboard
2. Navigate to Configuration → Data Sources
3. Add Prometheus data source with URL: `http://prometheus-kube-prometheus-prometheus:9090`
4. Import popular dashboards (IDs: 315, 1860, 6417)
Conclusion and Next Steps
Implementing Prometheus monitoring for Kubernetes provides essential visibility into your cluster's health and performance. This comprehensive setup enables proactive monitoring, efficient troubleshooting, and informed capacity planning decisions.
Key Achievements
By following this guide, you have:
- Successfully deployed Prometheus in your Kubernetes cluster
- Configured comprehensive metric collection from nodes, pods, and applications
- Set up alerting rules for proactive monitoring
- Implemented best practices for security and performance
- Gained practical troubleshooting skills
Recommended Next Steps
1. Expand Monitoring Coverage:
- Add custom application metrics
- Implement distributed tracing with Jaeger
- Monitor external services and dependencies
2. Enhance Alerting:
- Configure Alertmanager for notifications
- Implement alert routing and silencing
- Set up integration with incident management tools
3. Improve Visualization:
- Create custom Grafana dashboards
- Implement SLI/SLO monitoring
- Set up automated reporting
4. Scale and Optimize:
- Implement Prometheus federation for large clusters
- Consider Thanos for long-term storage
- Optimize query performance and resource usage
5. Security Hardening:
- Implement authentication and authorization
- Set up TLS encryption
- Regular security audits and updates
The monitoring foundation you've established forms the cornerstone of reliable Kubernetes operations. Continue building upon this setup to create a comprehensive observability platform that supports your organization's growing containerized infrastructure needs.
Remember that effective monitoring is an iterative process. Regularly review and refine your monitoring strategy based on operational experience, changing requirements, and evolving best practices in the Kubernetes ecosystem.