How to monitor Kubernetes with Prometheus on Linux | Kubernetes & Orchestration Tutorial

How to Monitor Kubernetes with Prometheus on Linux Monitoring Kubernetes clusters is essential for maintaining application performance, ensuring system reliability, and troubleshooting issues before they impact users. Prometheus, an open-source monitoring and alerting toolkit, has become the de facto standard for Kubernetes monitoring due to its powerful metrics collection capabilities and seamless integration with cloud-native technologies. This comprehensive guide will walk you through setting up Prometheus to monitor your Kubernetes cluster on Linux, from initial installation to advanced configuration and troubleshooting. Whether you're a DevOps engineer, system administrator, or developer working with containerized applications, this tutorial provides the knowledge needed to implement robust monitoring solutions. Prerequisites and Requirements Before diving into the setup process, ensure you have the following components ready: System Requirements - Linux Distribution: Ubuntu 18.04+, CentOS 7+, or RHEL 7+ - CPU: Minimum 2 cores (4+ recommended for production) - Memory: 4GB RAM minimum (8GB+ for production environments) - Storage: 50GB available disk space (more for metric retention) - Network: Stable internet connection for downloading components Required Software - Kubernetes Cluster: Version 1.18 or later (can be minikube, kubeadm, or managed service) - kubectl: Configured to communicate with your cluster - Helm: Version 3.0+ (recommended for easier deployment) - Docker: For container runtime (if not using containerd) Access Requirements - Cluster administrator privileges - Ability to create namespaces, deployments, and services - Network access to cluster nodes and pods Verification Commands Before proceeding, verify your environment: ```bash Check Kubernetes cluster status kubectl cluster-info Verify node readiness kubectl get nodes Check available resources kubectl top nodes Confirm Helm installation helm version ``` Understanding Prometheus Architecture in Kubernetes Prometheus operates on a pull-based model, periodically scraping metrics from configured targets. In a Kubernetes environment, the architecture typically includes: Core Components 1. Prometheus Server: Central component that scrapes and stores metrics 2. Node Exporter: Collects hardware and OS metrics from cluster nodes 3. kube-state-metrics: Exposes cluster-level metrics about Kubernetes objects 4. cAdvisor: Built into kubelet, provides container resource usage metrics 5. Alertmanager: Handles alerts sent by Prometheus server Service Discovery Kubernetes service discovery allows Prometheus to automatically discover and monitor: - Pods with specific annotations - Services exposing metrics endpoints - Nodes in the cluster - API server metrics Step-by-Step Installation Guide Method 1: Using Helm Charts (Recommended) Helm provides the most straightforward way to deploy Prometheus with sensible defaults and easy customization options. Step 1: Add Prometheus Helm Repository ```bash Add the Prometheus community Helm repository helm repo add prometheus-community https://prometheus-community.github.io/helm-charts Update Helm repositories helm repo update Verify repository addition helm search repo prometheus ``` Step 2: Create Monitoring Namespace ```bash Create dedicated namespace for monitoring components kubectl create namespace monitoring Verify namespace creation kubectl get namespaces ``` Step 3: Install Prometheus Stack ```bash Install kube-prometheus-stack (includes Prometheus, Grafana, and Alertmanager) helm install prometheus prometheus-community/kube-prometheus-stack \ --namespace monitoring \ --set prometheus.prometheusSpec.storageSpec.volumeClaimTemplate.spec.storageClassName=default \ --set prometheus.prometheusSpec.storageSpec.volumeClaimTemplate.spec.accessModes[0]=ReadWriteOnce \ --set prometheus.prometheusSpec.storageSpec.volumeClaimTemplate.spec.resources.requests.storage=50Gi Monitor installation progress kubectl get pods -n monitoring -w ``` Step 4: Verify Installation ```bash Check all monitoring components kubectl get all -n monitoring Verify Prometheus server is running kubectl get pods -n monitoring | grep prometheus-prometheus Check services kubectl get svc -n monitoring ``` Method 2: Manual YAML Deployment For more control over the configuration, you can deploy Prometheus using custom YAML manifests. Step 1: Create Service Account and RBAC ```yaml prometheus-rbac.yaml apiVersion: v1 kind: ServiceAccount metadata: name: prometheus namespace: monitoring --- apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRole metadata: name: prometheus rules: - apiGroups: [""] resources: - nodes - nodes/proxy - services - endpoints - pods verbs: ["get", "list", "watch"] - apiGroups: - extensions resources: - ingresses verbs: ["get", "list", "watch"] - nonResourceURLs: ["/metrics"] verbs: ["get"] --- apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRoleBinding metadata: name: prometheus roleRef: apiGroup: rbac.authorization.k8s.io kind: ClusterRole name: prometheus subjects: - kind: ServiceAccount name: prometheus namespace: monitoring ``` Step 2: Create ConfigMap for Prometheus Configuration ```yaml prometheus-config.yaml apiVersion: v1 kind: ConfigMap metadata: name: prometheus-config namespace: monitoring data: prometheus.yml: | global: scrape_interval: 15s evaluation_interval: 15s rule_files: # - "first_rules.yml" # - "second_rules.yml" scrape_configs: - job_name: 'prometheus' static_configs: - targets: ['localhost:9090'] - job_name: 'kubernetes-apiservers' kubernetes_sd_configs: - role: endpoints scheme: https tls_config: ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token relabel_configs: - source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name] action: keep regex: default;kubernetes;https - job_name: 'kubernetes-nodes' kubernetes_sd_configs: - role: node scheme: https tls_config: ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token relabel_configs: - action: labelmap regex: __meta_kubernetes_node_label_(.+) - target_label: __address__ replacement: kubernetes.default.svc:443 - source_labels: [__meta_kubernetes_node_name] regex: (.+) target_label: __metrics_path__ replacement: /api/v1/nodes/${1}/proxy/metrics - job_name: 'kubernetes-pods' kubernetes_sd_configs: - role: pod relabel_configs: - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape] action: keep regex: true - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path] action: replace target_label: __metrics_path__ regex: (.+) ``` Step 3: Deploy Prometheus Server ```yaml prometheus-deployment.yaml apiVersion: apps/v1 kind: Deployment metadata: name: prometheus-deployment namespace: monitoring labels: app: prometheus-server spec: replicas: 1 selector: matchLabels: app: prometheus-server template: metadata: labels: app: prometheus-server spec: serviceAccountName: prometheus containers: - name: prometheus image: prom/prometheus:latest args: - '--storage.tsdb.retention.time=12h' - '--config.file=/etc/prometheus/prometheus.yml' - '--storage.tsdb.path=/prometheus/' - '--web.console.libraries=/etc/prometheus/console_libraries' - '--web.console.templates=/etc/prometheus/consoles' - '--web.enable-lifecycle' ports: - containerPort: 9090 resources: requests: cpu: 500m memory: 500M limits: cpu: 1 memory: 1Gi volumeMounts: - name: prometheus-config-volume mountPath: /etc/prometheus/ - name: prometheus-storage-volume mountPath: /prometheus/ volumes: - name: prometheus-config-volume configMap: defaultMode: 420 name: prometheus-config - name: prometheus-storage-volume emptyDir: {} ``` Step 4: Apply Configurations ```bash Apply all configurations kubectl apply -f prometheus-rbac.yaml kubectl apply -f prometheus-config.yaml kubectl apply -f prometheus-deployment.yaml Create service for Prometheus kubectl expose deployment prometheus-deployment --port=9090 --target-port=9090 --name=prometheus-service --namespace=monitoring ``` Configuring Node Exporter Node Exporter provides detailed metrics about the underlying infrastructure, including CPU, memory, disk, and network statistics. Deploy Node Exporter as DaemonSet ```yaml node-exporter.yaml apiVersion: apps/v1 kind: DaemonSet metadata: name: node-exporter namespace: monitoring labels: app: node-exporter spec: selector: matchLabels: app: node-exporter template: metadata: labels: app: node-exporter spec: hostPID: true hostIPC: true hostNetwork: true containers: - name: node-exporter image: prom/node-exporter:latest args: - '--path.procfs=/host/proc' - '--path.sysfs=/host/sys' - '--collector.filesystem.ignored-mount-points=^/(sys|proc|dev|host|etc)($|/)' ports: - containerPort: 9100 hostPort: 9100 name: metrics resources: requests: memory: 30Mi cpu: 100m limits: memory: 50Mi cpu: 200m volumeMounts: - name: proc mountPath: /host/proc readOnly: true - name: sys mountPath: /host/sys readOnly: true - name: root mountPath: /rootfs readOnly: true volumes: - name: proc hostPath: path: /proc - name: sys hostPath: path: /sys - name: root hostPath: path: / tolerations: - effect: NoSchedule operator: Exists ``` Apply the Node Exporter configuration: ```bash kubectl apply -f node-exporter.yaml Verify Node Exporter pods are running on all nodes kubectl get pods -n monitoring -o wide | grep node-exporter ``` Setting Up kube-state-metrics kube-state-metrics provides insights into the state of Kubernetes objects like deployments, pods, and services. ```yaml kube-state-metrics.yaml apiVersion: apps/v1 kind: Deployment metadata: name: kube-state-metrics namespace: monitoring spec: replicas: 1 selector: matchLabels: app: kube-state-metrics template: metadata: labels: app: kube-state-metrics spec: serviceAccountName: kube-state-metrics containers: - name: kube-state-metrics image: k8s.gcr.io/kube-state-metrics/kube-state-metrics:v2.6.0 ports: - containerPort: 8080 name: http-metrics - containerPort: 8081 name: telemetry livenessProbe: httpGet: path: /healthz port: 8080 initialDelaySeconds: 5 timeoutSeconds: 5 readinessProbe: httpGet: path: / port: 8081 initialDelaySeconds: 5 timeoutSeconds: 5 --- apiVersion: v1 kind: ServiceAccount metadata: name: kube-state-metrics namespace: monitoring --- apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRole metadata: name: kube-state-metrics rules: - apiGroups: [""] resources: - configmaps - secrets - nodes - pods - services - resourcequotas - replicationcontrollers - limitranges - persistentvolumeclaims - persistentvolumes - namespaces - endpoints verbs: ["list", "watch"] - apiGroups: ["apps"] resources: - statefulsets - daemonsets - deployments - replicasets verbs: ["list", "watch"] --- apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRoleBinding metadata: name: kube-state-metrics roleRef: apiGroup: rbac.authorization.k8s.io kind: ClusterRole name: kube-state-metrics subjects: - kind: ServiceAccount name: kube-state-metrics namespace: monitoring ``` Accessing Prometheus Dashboard Port Forwarding Method The quickest way to access Prometheus is through port forwarding: ```bash Forward Prometheus port to local machine kubectl port-forward -n monitoring svc/prometheus-kube-prometheus-prometheus 9090:9090 Access Prometheus at http://localhost:9090 ``` LoadBalancer Service Method For persistent access, create a LoadBalancer service: ```yaml prometheus-loadbalancer.yaml apiVersion: v1 kind: Service metadata: name: prometheus-loadbalancer namespace: monitoring spec: type: LoadBalancer ports: - port: 9090 targetPort: 9090 protocol: TCP selector: app.kubernetes.io/name: prometheus ``` Ingress Method For production environments, use an Ingress controller: ```yaml prometheus-ingress.yaml apiVersion: networking.k8s.io/v1 kind: Ingress metadata: name: prometheus-ingress namespace: monitoring annotations: kubernetes.io/ingress.class: nginx nginx.ingress.kubernetes.io/rewrite-target: / spec: rules: - host: prometheus.yourdomain.com http: paths: - path: / pathType: Prefix backend: service: name: prometheus-kube-prometheus-prometheus port: number: 9090 ``` Essential Queries and Metrics Once Prometheus is running, you can start exploring metrics using PromQL (Prometheus Query Language). Basic System Metrics ```promql CPU usage per node 100 - (avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) by (instance) * 100) Memory usage per node (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 Disk usage per node 100 - ((node_filesystem_avail_bytes{mountpoint="/",fstype!="rootfs"} / node_filesystem_size_bytes{mountpoint="/",fstype!="rootfs"}) * 100) Network I/O per node rate(node_network_receive_bytes_total[5m]) rate(node_network_transmit_bytes_total[5m]) ``` Kubernetes-Specific Metrics ```promql Pod CPU usage rate(container_cpu_usage_seconds_total{container!="POD",container!=""}[5m]) Pod memory usage container_memory_working_set_bytes{container!="POD",container!=""} Number of pods per namespace count(kube_pod_info) by (namespace) Pod restart count increase(kube_pod_container_status_restarts_total[1h]) Node resource allocation (kube_node_status_allocatable{resource="cpu"} - kube_node_status_capacity{resource="cpu"}) / kube_node_status_capacity{resource="cpu"} ``` Application Metrics ```promql HTTP request rate rate(http_requests_total[5m]) HTTP error rate rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) Request duration histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) ``` Configuring Alerting Rules Create alerting rules to proactively monitor your cluster: ```yaml prometheus-rules.yaml apiVersion: monitoring.coreos.com/v1 kind: PrometheusRule metadata: name: kubernetes-alerts namespace: monitoring spec: groups: - name: kubernetes.rules rules: - alert: KubernetesPodCrashLooping expr: rate(kube_pod_container_status_restarts_total[10m]) 60 10 > 0 for: 2m labels: severity: warning annotations: summary: Pod is crash looping description: "Pod {{ $labels.namespace }}/{{ $labels.pod }} is crash looping" - alert: KubernetesNodeNotReady expr: kube_node_status_condition{condition="Ready",status="true"} == 0 for: 10m labels: severity: critical annotations: summary: Kubernetes node not ready description: "Node {{ $labels.node }} has been unready for more than 10 minutes" - alert: HighCPUUsage expr: 100 - (avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) by (instance) * 100) > 80 for: 5m labels: severity: warning annotations: summary: High CPU usage detected description: "CPU usage is above 80% on {{ $labels.instance }}" - alert: HighMemoryUsage expr: (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 > 85 for: 5m labels: severity: warning annotations: summary: High memory usage detected description: "Memory usage is above 85% on {{ $labels.instance }}" ``` Apply the alerting rules: ```bash kubectl apply -f prometheus-rules.yaml ``` Common Troubleshooting Issues Issue 1: Prometheus Not Scraping Targets Symptoms: Targets showing as "DOWN" in Prometheus UI Solutions: ```bash Check service discovery kubectl get endpoints -n monitoring Verify network policies kubectl get networkpolicies -A Check pod logs kubectl logs -n monitoring prometheus-prometheus-kube-prometheus-prometheus-0 Test connectivity kubectl exec -n monitoring prometheus-prometheus-kube-prometheus-prometheus-0 -- wget -qO- http://target-service:port/metrics ``` Issue 2: High Memory Usage Symptoms: Prometheus pod getting OOMKilled Solutions: ```bash Increase memory limits kubectl patch prometheus prometheus-kube-prometheus-prometheus -n monitoring --type='merge' -p='{"spec":{"resources":{"limits":{"memory":"4Gi"}}}}' Reduce retention time kubectl patch prometheus prometheus-kube-prometheus-prometheus -n monitoring --type='merge' -p='{"spec":{"retention":"7d"}}' Optimize queries and reduce cardinality ``` Issue 3: Storage Issues Symptoms: "no space left on device" errors Solutions: ```bash Check disk usage kubectl exec -n monitoring prometheus-prometheus-kube-prometheus-prometheus-0 -- df -h Increase storage size kubectl patch pvc prometheus-prometheus-kube-prometheus-prometheus-db-prometheus-prometheus-kube-prometheus-prometheus-0 -n monitoring -p='{"spec":{"resources":{"requests":{"storage":"100Gi"}}}}' Clean up old data kubectl exec -n monitoring prometheus-prometheus-kube-prometheus-prometheus-0 -- rm -rf /prometheus/01* ``` Issue 4: Service Discovery Problems Symptoms: Missing metrics from certain services Solutions: ```bash Check RBAC permissions kubectl auth can-i get pods --as=system:serviceaccount:monitoring:prometheus Verify annotations on pods kubectl get pods -o yaml | grep -A5 -B5 prometheus.io Check Prometheus configuration kubectl get prometheus prometheus-kube-prometheus-prometheus -n monitoring -o yaml ``` Performance Optimization and Best Practices Resource Management 1. CPU and Memory Sizing: ```yaml resources: requests: cpu: 1000m memory: 2Gi limits: cpu: 2000m memory: 4Gi ``` 2. Storage Configuration: ```yaml storageSpec: volumeClaimTemplate: spec: storageClassName: fast-ssd resources: requests: storage: 100Gi ``` Query Optimization 1. Use Recording Rules for frequently used complex queries: ```yaml - record: node:cpu_utilization:rate5m expr: 100 - (avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) by (instance) * 100) ``` 2. Limit Query Range and avoid high cardinality metrics: ```promql Good: Limited time range rate(http_requests_total[5m]) Avoid: Unbounded queries http_requests_total ``` Security Best Practices 1. Enable RBAC with minimal required permissions 2. Use Network Policies to restrict access 3. Implement Authentication for Prometheus UI 4. Encrypt Communication between components ```yaml Network policy example apiVersion: networking.k8s.io/v1 kind: NetworkPolicy metadata: name: prometheus-netpol namespace: monitoring spec: podSelector: matchLabels: app.kubernetes.io/name: prometheus policyTypes: - Ingress - Egress ingress: - from: - namespaceSelector: matchLabels: name: monitoring ports: - protocol: TCP port: 9090 ``` Monitoring Strategy 1. Implement the Four Golden Signals: - Latency: Response time metrics - Traffic: Request rate metrics - Errors: Error rate metrics - Saturation: Resource utilization metrics 2. Set Up Proper Alerting: - Create meaningful alert rules - Avoid alert fatigue - Implement escalation policies 3. Regular Maintenance: - Monitor Prometheus itself - Regular backups of configuration - Update components regularly Integration with Grafana While Prometheus excels at data collection and alerting, Grafana provides superior visualization capabilities. Install Grafana ```bash Add Grafana Helm repository helm repo add grafana https://grafana.github.io/helm-charts Install Grafana helm install grafana grafana/grafana \ --namespace monitoring \ --set persistence.enabled=true \ --set persistence.size=10Gi \ --set adminPassword=admin123 ``` Configure Prometheus Data Source 1. Access Grafana dashboard 2. Navigate to Configuration → Data Sources 3. Add Prometheus data source with URL: `http://prometheus-kube-prometheus-prometheus:9090` 4. Import popular dashboards (IDs: 315, 1860, 6417) Conclusion and Next Steps Implementing Prometheus monitoring for Kubernetes provides essential visibility into your cluster's health and performance. This comprehensive setup enables proactive monitoring, efficient troubleshooting, and informed capacity planning decisions. Key Achievements By following this guide, you have: - Successfully deployed Prometheus in your Kubernetes cluster - Configured comprehensive metric collection from nodes, pods, and applications - Set up alerting rules for proactive monitoring - Implemented best practices for security and performance - Gained practical troubleshooting skills Recommended Next Steps 1. Expand Monitoring Coverage: - Add custom application metrics - Implement distributed tracing with Jaeger - Monitor external services and dependencies 2. Enhance Alerting: - Configure Alertmanager for notifications - Implement alert routing and silencing - Set up integration with incident management tools 3. Improve Visualization: - Create custom Grafana dashboards - Implement SLI/SLO monitoring - Set up automated reporting 4. Scale and Optimize: - Implement Prometheus federation for large clusters - Consider Thanos for long-term storage - Optimize query performance and resource usage 5. Security Hardening: - Implement authentication and authorization - Set up TLS encryption - Regular security audits and updates The monitoring foundation you've established forms the cornerstone of reliable Kubernetes operations. Continue building upon this setup to create a comprehensive observability platform that supports your organization's growing containerized infrastructure needs. Remember that effective monitoring is an iterative process. Regularly review and refine your monitoring strategy based on operational experience, changing requirements, and evolving best practices in the Kubernetes ecosystem.