How to Monitor Kubernetes Clusters with Prometheus

When your Kubernetes cluster starts serving production traffic, you face a monitoring challenge that traditional tools cannot solve: thousands of ephemeral containers creating and destroying metrics every minute, distributed across nodes with no central state. Without proper monitoring, your first indication of a problem is often a user complaint or a complete service outage.

This guide shows you how to implement Prometheus monitoring for Kubernetes clusters from initial setup through production-grade alerting. You will learn the architecture decisions that matter, how to avoid the most common configuration mistakes that lead to metric loss, and how to structure your monitoring to scale with your cluster. The approach covers both managed Kubernetes services and self-hosted clusters.

We will walk through the complete setup process, starting with understanding how Prometheus integrates with Kubernetes' native service discovery, then implementing the monitoring stack, and finally configuring the alerts that catch real problems before they cascade.

Why Prometheus Is Built for Kubernetes Monitoring

Prometheus solves the fundamental problem of monitoring dynamic infrastructure: traditional monitoring systems expect static hosts with predictable IP addresses, but Kubernetes destroys and recreates containers constantly. A deployment rollout might replace every pod in your cluster within seconds, each with a new IP address and hostname.

Prometheus handles this through service discovery that integrates directly with the Kubernetes API server. Instead of configuring static targets, you define discovery rules that automatically find pods, services, and nodes as they appear. When a new pod starts, Prometheus detects it within seconds and begins scraping metrics. When that pod terminates, Prometheus stops scraping without manual intervention.

The pull-based model matters more than most implementation guides acknowledge. Push-based systems require every pod to know where to send metrics, which creates a configuration distribution problem. With Prometheus, pods simply expose a metrics endpoint, and the centralized Prometheus server discovers and scrapes them. This architectural choice eliminates an entire class of failure modes where monitoring stops working because a configuration update did not reach all containers.

Kubernetes components already expose Prometheus metrics by default. The kubelet, API server, controller manager, and scheduler all provide detailed internal state through Prometheus-formatted endpoints. This native integration means you get deep cluster visibility without installing agents or modifying system components.

Prometheus Architecture in Kubernetes

A production Prometheus setup in Kubernetes consists of several components, and understanding their relationships prevents the common mistake of treating Prometheus as a single binary you deploy and forget.

The Prometheus server is the core component that discovers targets, scrapes metrics, evaluates alerting rules, and stores time-series data. In Kubernetes, you run this as a StatefulSet with persistent storage, not a Deployment, because each Prometheus instance maintains its own time-series database that cannot be shared or replicated.

Service discovery happens through Kubernetes API integration. Prometheus uses the API server to watch for changes to pods, services, endpoints, and nodes. This watch mechanism is efficient because Prometheus receives push notifications of changes rather than polling constantly. The API server becomes a dependency, which means RBAC permissions must be configured correctly or service discovery silently fails.

The metrics endpoint pattern requires each application to expose metrics at a standard HTTP path, typically /metrics. The endpoint returns text in Prometheus exposition format: metric names with labels and values. This stateless design means scraping can fail and retry without corrupting data, unlike push-based systems where a failed delivery loses metrics permanently.

Persistent storage determines how long you can query historical metrics. Prometheus stores data in a local time-series database optimized for high cardinality time-series. For production clusters, you need to provision persistent volumes that survive pod restarts and can handle Prometheus' write-heavy workload. The storage volume size depends on your retention period and metric cardinality, with a typical production cluster generating 5-10 GB per day.

Key Insight: Prometheus is not designed for long-term metric storage beyond a few weeks. The local time-series database trades storage efficiency for query performance. For long-term retention, you need a separate system like Thanos or Cortex that can downsample and archive data. Most monitoring problems occur in the recent past, so the typical pattern is 15 days of full-resolution data in Prometheus, with downsampled data in long-term storage.

Installing Prometheus with the Operator Pattern

The Prometheus Operator is the standard way to deploy Prometheus in Kubernetes because it handles the complexity of dynamic configuration that breaks traditional deployment approaches. Without the operator, you would need to restart Prometheus every time you add a new service to monitor, which causes metric collection gaps.

The operator introduces custom resource definitions that let you configure monitoring declaratively. Instead of editing Prometheus config files, you create ServiceMonitor resources that define what to scrape. The operator watches these resources and automatically updates the running Prometheus configuration without restarts.

Install the operator using the kube-prometheus-stack Helm chart, which includes Prometheus, the operator, Grafana, and pre-configured alerts for common Kubernetes issues:

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update

helm install prometheus prometheus-community/kube-prometheus-stack \
  --namespace monitoring \
  --create-namespace \
  --set prometheus.prometheusSpec.retention=15d \
  --set prometheus.prometheusSpec.storageSpec.volumeClaimTemplate.spec.resources.requests.storage=50Gi

This installation creates several critical resources. The Prometheus StatefulSet runs the server with persistent storage. ServiceMonitor resources define scrape targets for Kubernetes components. PrometheusRule resources contain alerting rules. The operator continuously reconciles these resources to maintain the desired state.

The namespace isolation matters for production deployments. Running Prometheus in a dedicated monitoring namespace separates it from application workloads and allows you to apply different resource limits and network policies. Most production incidents do not affect the monitoring system that needs to observe them.

Understanding Operator Custom Resources

The Prometheus resource defines a Prometheus server instance. You can run multiple Prometheus instances with different configurations, which is common in large clusters where you separate application metrics from infrastructure metrics:

apiVersion: monitoring.coreos.com/v1
kind: Prometheus
metadata:
  name: main
  namespace: monitoring
spec:
  replicas: 2
  retention: 15d
  resources:
    requests:
      memory: 4Gi
      cpu: 2
  storage:
    volumeClaimTemplate:
      spec:
        accessModes: ["ReadWriteOnce"]
        resources:
          requests:
            storage: 50Gi
  serviceMonitorSelector:
    matchLabels:
      prometheus: main

The ServiceMonitor resource tells Prometheus which services to scrape. It uses label selectors to find services and defines how to extract metrics from them. This indirection is powerful: when you deploy a new application with the matching label, Prometheus automatically starts monitoring it:

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: app-metrics
  namespace: monitoring
  labels:
    prometheus: main
spec:
  selector:
    matchLabels:
      app: my-application
  endpoints:
  - port: metrics
    interval: 30s
    path: /metrics

This configuration scrapes any service with the label app: my-application every 30 seconds on the port named "metrics". The service definition must include a port with name: metrics for this to work, which is a common source of configuration errors.

Configuring Service Discovery for Different Target Types

Kubernetes exposes several different resource types as potential monitoring targets, and each requires a different discovery approach. The kube-prometheus-stack includes ServiceMonitors for Kubernetes components, but you need to understand how they work to monitor your own applications.

Monitoring Pods Directly

The PodMonitor resource discovers and scrapes pods based on labels, regardless of whether they are behind a service. This is useful for DaemonSets and StatefulSets where you want to monitor each pod individually rather than load-balancing across them:

apiVersion: monitoring.coreos.com/v1
kind: PodMonitor
metadata:
  name: node-exporter
  namespace: monitoring
spec:
  selector:
    matchLabels:
      app: node-exporter
  podMetricsEndpoints:
  - port: metrics
    interval: 30s

Pod-based discovery sees churn directly. Every pod replacement triggers a target change, which is fine for stable workloads but can overwhelm Prometheus in rapidly scaling deployments. Service-based discovery aggregates this churn by monitoring the service endpoint, which remains stable even as backend pods change.

Monitoring Kubernetes Components

The control plane components expose metrics but require different discovery methods depending on how your cluster is deployed. In managed Kubernetes services like GKE or EKS, you often cannot access control plane metrics because the provider runs those components outside your cluster.

For self-hosted clusters, the kubelet metrics require special handling. The kubelet exposes metrics on port 10250 with authentication, and the kube-prometheus-stack includes a ServiceMonitor that handles the TLS and authentication setup:

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: kubelet
  namespace: monitoring
spec:
  endpoints:
  - port: https-metrics
    scheme: https
    tlsConfig:
      caFile: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
      insecureSkipVerify: true
    bearerTokenFile: /var/run/secrets/kubernetes.io/serviceaccount/token
  selector:
    matchLabels:
      k8s-app: kubelet

The cAdvisor metrics that provide container-level CPU and memory usage come from the kubelet's /metrics/cadvisor endpoint. This is a separate endpoint from the kubelet's own metrics, and you need a distinct ServiceMonitor path configuration to collect both.

Warning: The insecureSkipVerify setting bypasses TLS certificate validation, which is necessary because kubelet certificates use the node's IP address rather than a DNS name. In production, this is acceptable for cluster-internal monitoring, but you should never expose these metrics endpoints outside the cluster. The bearerToken provides authentication through the service account that Prometheus runs under.

Instrumenting Applications for Prometheus

Exposing metrics from your application requires a client library and an understanding of metric types. Prometheus defines four metric types, each suited for different measurement scenarios.

Counters only increase and are used for cumulative values like request counts or error counts. The Prometheus query language includes functions like rate() that calculate per-second rates from counter values, handling resets automatically. A common mistake is using a counter for a value that can decrease, which produces nonsensical rate calculations.

Gauges represent values that can go up or down, like current memory usage or queue depth. You set a gauge to the current value each time you observe it. Unlike counters, gauges have no special reset handling, so if a pod restarts and the gauge returns to zero, that zero value is meaningful.

Histograms track distributions of values, typically request durations or response sizes. They pre-calculate buckets so Prometheus can approximate percentiles without storing every individual measurement. The bucket boundaries matter significantly: too few buckets lose precision, too many buckets increase cardinality and storage costs.

Summaries are similar to histograms but calculate percentiles on the client side. This shifts CPU cost from query time to collection time and means you cannot aggregate summaries across multiple instances. Histograms are almost always the better choice for server-side applications.

Here is a minimal Node.js application with Prometheus metrics using the prom-client library:

const express = require('express');
const client = require('prom-client');

const app = express();

// Create a Registry to register metrics
const register = new client.Registry();

// Add default metrics (process CPU, memory, etc.)
client.collectDefaultMetrics({ register });

// Create custom metrics
const httpRequestDuration = new client.Histogram({
  name: 'http_request_duration_seconds',
  help: 'Duration of HTTP requests in seconds',
  labelNames: ['method', 'route', 'status_code'],
  buckets: [0.001, 0.005, 0.015, 0.05, 0.1, 0.5, 1, 5]
});

const httpRequestsTotal = new client.Counter({
  name: 'http_requests_total',
  help: 'Total number of HTTP requests',
  labelNames: ['method', 'route', 'status_code']
});

register.registerMetric(httpRequestDuration);
register.registerMetric(httpRequestsTotal);

// Middleware to track metrics
app.use((req, res, next) => {
  const start = Date.now();

  res.on('finish', () => {
    const duration = (Date.now() - start) / 1000;
    httpRequestDuration.labels(req.method, req.route?.path || req.path, res.statusCode).observe(duration);
    httpRequestsTotal.labels(req.method, req.route?.path || req.path, res.statusCode).inc();
  });

  next();
});

// Metrics endpoint
app.get('/metrics', async (req, res) => {
  res.set('Content-Type', register.contentType);
  res.end(await register.metrics());
});

app.listen(3000);

The histogram buckets are tuned for a typical web application where most requests complete in under 50ms but some take several seconds. If your application has different performance characteristics, adjust the buckets accordingly. The default buckets in most Prometheus libraries are designed for services measured in seconds, not milliseconds, and will lose precision for fast APIs.

Label Cardinality and Its Cost

Labels allow you to slice metrics by dimensions like HTTP method, status code, or user ID. Each unique combination of label values creates a separate time-series in Prometheus' database. This is where most production Prometheus deployments encounter problems.

Including a user ID as a label creates a time-series for every user. A million users means a million time-series per metric. Prometheus can handle high cardinality, but query performance degrades and storage costs increase dramatically. The guideline is to keep unique label combinations under 10,000 per metric for good performance.

The safe labels are those with bounded cardinality: HTTP method has about 10 values, status code has dozens, service name is bounded by your number of services. Dangerous labels are unbounded: user IDs, request IDs, email addresses, IP addresses. If you need to track per-user metrics, aggregate them in your application and expose summary statistics, not individual user metrics.

Pro Tip: The promtool command-line utility can analyze your metrics for cardinality issues before they hit production. Run promtool check metrics against your /metrics endpoint to identify labels that might cause problems. In production, monitor the prometheus_tsdb_symbol_table_size_bytes metric, which correlates with cardinality. A sudden increase indicates a new high-cardinality metric was introduced.

Setting Up Critical Alerts

Prometheus alerting separates detection from notification. Prometheus evaluates alerting rules and determines when alerts should fire, but it does not send notifications. The Alertmanager component handles routing alerts to notification channels and managing silences and inhibitions.

PrometheusRule resources define alerting rules in PromQL, Prometheus' query language. The operator loads these rules into Prometheus automatically:

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: application-alerts
  namespace: monitoring
spec:
  groups:
  - name: application
    interval: 30s
    rules:
    - alert: HighErrorRate
      expr: |
        sum(rate(http_requests_total{status_code=~"5.."}[5m])) by (service)
        / sum(rate(http_requests_total[5m])) by (service)
        > 0.05
      for: 5m
      labels:
        severity: warning
      annotations:
        summary: "High error rate on {{ $labels.service }}"
        description: "Service {{ $labels.service }} has {{ $value | humanizePercentage }} error rate"

This rule fires when any service has more than 5% of requests returning 5xx errors for five consecutive minutes. The for clause prevents alerting on transient spikes. The five-minute window balances between catching real problems quickly and ignoring brief issues that resolve themselves.

The error rate calculation divides 5xx requests by total requests over the same time window. This produces a ratio between 0 and 1. The rate() function is essential: it converts the counter metric into a per-second rate and handles counter resets that happen during pod restarts.

Pod and Node Level Alerts

Kubernetes-specific alerts need to account for the expected behavior of the platform. Pods restart during deployments, which is normal, but pods crash-looping indicates a problem:

- alert: PodCrashLooping
  expr: |
    rate(kube_pod_container_status_restarts_total[15m]) > 0.1
  for: 15m
  labels:
    severity: critical
  annotations:
    summary: "Pod {{ $labels.namespace }}/{{ $labels.pod }} is crash looping"
    description: "Pod has restarted {{ $value }} times per second over 15 minutes"

Node resource exhaustion requires alerts before the node runs out of resources completely. By the time a node is at 100% memory, it is already evicting pods:

- alert: NodeMemoryPressure
  expr: |
    (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) > 0.85
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: "Node {{ $labels.node }} is under memory pressure"
    description: "Node has {{ $value | humanizePercentage }} memory usage"

The threshold of 85% provides time to investigate before the node reaches the kernel's out-of-memory killer threshold. Different workloads require different thresholds: batch processing might tolerate 95% memory usage, while latency-sensitive applications might need alerting at 70%.

Alertmanager Configuration

The Alertmanager receives alerts from Prometheus and routes them to notification channels. The routing configuration supports grouping related alerts together and suppressing redundant notifications:

apiVersion: v1
kind: Secret
metadata:
  name: alertmanager-config
  namespace: monitoring
stringData:
  alertmanager.yaml: |
    global:
      slack_api_url: 'https://hooks.slack.com/services/YOUR/WEBHOOK/URL'

    route:
      receiver: 'default'
      group_by: ['alertname', 'cluster']
      group_wait: 10s
      group_interval: 5m
      repeat_interval: 4h

      routes:
      - match:
          severity: critical
        receiver: 'pagerduty'
      - match:
          severity: warning
        receiver: 'slack'

    receivers:
    - name: 'default'
      slack_configs:
      - channel: '#alerts'
        title: "{{ .GroupLabels.alertname }}"
        text: "{{ range .Alerts }}{{ .Annotations.description }}\n{{ end }}"

    - name: 'pagerduty'
      pagerduty_configs:
      - service_key: 'YOUR_PAGERDUTY_KEY'

    - name: 'slack'
      slack_configs:
      - channel: '#warnings'

The group_by setting bunches alerts with the same name and cluster into a single notification. Without grouping, a cluster-wide issue might trigger hundreds of individual alerts as each pod fails. The group_wait delays the first notification to collect related alerts, while group_interval controls how often to send updates about ongoing alert groups.

Best Practice: Start with a small set of high-signal alerts and expand based on actual incidents. Most monitoring setups suffer from too many alerts, not too few. Every alert should be actionable: if the correct response is "wait and see," it should not page anyone. Use warning severity for alerts that need investigation during business hours and critical severity only for issues that require immediate response.

Query Patterns for Kubernetes Metrics

PromQL queries combine metrics with functions to answer operational questions. Understanding the common patterns helps you both query existing data and design new metrics.

Aggregating Across Pods

Most Kubernetes deployments run multiple replicas of each service. To see total request rate across all pods, you aggregate by service:

sum(rate(http_requests_total[5m])) by (service)

This sums the per-second request rate across all pods for each service. The by (service) clause groups results by service name, producing one time-series per service. Without the grouping, you would get a single total across all services, which is rarely useful.

To see per-pod rates, use by (pod, service) instead. This shows which pods handle more traffic, which is useful for investigating load balancing issues or identifying hot pods that are resource-constrained.

Resource Usage Patterns

Container memory usage requires joining metrics from multiple sources. The actual usage comes from cAdvisor, but the limit comes from the pod spec:

container_memory_working_set_bytes / on(namespace, pod, container)
group_left kube_pod_container_resource_limits{resource="memory"}

This divides current memory usage by the configured limit to get a utilization ratio. The on() clause specifies which labels to join on, and group_left handles the case where limits might not be set for all containers.

CPU usage needs rate() because the metric is a counter tracking cumulative CPU seconds:

sum(rate(container_cpu_usage_seconds_total[5m])) by (namespace, pod)

The result is in CPU cores, where 1.0 means one full core. A value of 0.5 means the pod is using half a core on average over the last five minutes.

Percentile Calculations from Histograms

The histogram_quantile() function approximates percentiles from histogram metrics. For request duration, the p95 latency is:

histogram_quantile(0.95,
  sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service)
)

This groups by the le label, which represents histogram bucket boundaries, and by service. The function interpolates between buckets to estimate the 95th percentile. The accuracy depends on how well your bucket boundaries align with actual latencies: if most requests fall in a single wide bucket, the percentile estimate is imprecise.

Scaling Prometheus for Large Clusters

A single Prometheus instance handles thousands of targets and millions of time-series, but large Kubernetes clusters eventually exceed what one instance can scrape. The scaling problems appear as gaps in metrics, slow queries, and Prometheus falling behind on scraping.

Horizontal Sharding

Running multiple Prometheus instances with different responsibilities is the standard scaling approach. You can shard by namespace, by metric type, or by target type. A common pattern is one Prometheus for infrastructure metrics and another for application metrics:

apiVersion: monitoring.coreos.com/v1
kind: Prometheus
metadata:
  name: apps
  namespace: monitoring
spec:
  serviceMonitorNamespaceSelector:
    matchExpressions:
    - key: monitoring
      operator: NotIn
      values:
      - infrastructure
  serviceMonitorSelector: {}

This Prometheus instance only monitors services in namespaces that are not labeled with monitoring: infrastructure. A second Prometheus instance uses the opposite selector to monitor only infrastructure.

The tradeoff is that queries cannot span both instances. If you need to correlate application and infrastructure metrics, you need a global query layer like Thanos or Cortex that can federate queries across multiple Prometheus instances.

Remote Write for Long-term Storage

Prometheus' local storage is efficient for recent data but was not designed for years of retention. The remote write protocol sends metrics to a separate storage system optimized for long-term retention:

apiVersion: monitoring.coreos.com/v1
kind: Prometheus
metadata:
  name: main
spec:
  remoteWrite:
  - url: "http://thanos-receive:19291/api/v1/receive"
    queueConfig:
      maxSamplesPerSend: 1000
      maxShards: 200

Thanos Receive ingests the metrics and stores them in object storage like S3. You can then query years of data through Thanos Query, which combines recent data from Prometheus with historical data from object storage.

The queueConfig tuning matters for large-scale deployments. Prometheus buffers metrics in a queue before sending, and if the queue fills faster than the remote endpoint can accept data, Prometheus drops metrics. The maxShards setting controls parallelism: more shards increase throughput but also increase load on the remote endpoint.

Critical: Remote write is eventually consistent. There is a delay between when Prometheus scrapes a metric and when it appears in remote storage, and if Prometheus crashes, some metrics might never make it to remote storage. For critical alerts, always evaluate rules against local Prometheus data, not remote storage. Use remote storage for dashboards and historical analysis, not real-time alerting.

Troubleshooting Common Issues

Most Prometheus problems fall into a few categories that produce similar symptoms but require different fixes.

Missing Metrics from Specific Targets

When metrics from a particular service are not appearing, the first check is whether Prometheus discovered the target. The Prometheus UI at /targets shows all discovered targets and their scrape status. If a target is missing entirely, the problem is service discovery configuration. If the target appears but scraping fails, the problem is connectivity or authentication.

Common service discovery failures include label selector mismatches between the ServiceMonitor and the Service, and the service port name not matching the ServiceMonitor port name. Kubernetes service discovery is case-sensitive and exact-match only: a service with app: myapp will not match a ServiceMonitor selecting app: my-app.

Scrape failures show the error in the UI. Certificate errors mean TLS configuration is wrong. Timeout errors mean the target is too slow to respond within the scrape interval. Connection refused means the port is wrong or the metrics endpoint is not listening.

Metrics Exist But Queries Return No Data

When the target is scraped successfully but queries return empty results, the issue is usually the query itself. The most common mistake is querying a counter without rate(): a counter metric shows cumulative values that constantly increase, and graphing it directly produces a climbing line that is rarely useful. Wrap it in rate() or increase() to see the rate of change.

Label mismatches also cause empty queries. If you query for job="myapp" but the actual label value is job="myapp-metrics", you get no results. Use the expression browser to explore available label values: type the metric name and look at the label values that appear.

Prometheus Running Out of Memory

Prometheus' memory usage correlates with the number of active time-series. The prometheus_tsdb_head_series metric shows current series count. As a rough guideline, Prometheus uses about 1-3 KB per series in memory, so 1 million series requires 1-3 GB of RAM plus overhead.

Sudden memory increases indicate high-cardinality metrics were introduced. Check which metrics have the most series:

topk(10, count by (__name__)({__name__=~".+"}))

This query counts series by metric name and returns the top 10. If a single metric has hundreds of thousands of series, it likely has unbounded label cardinality. Find and fix the label, or drop that metric if it is not essential.

The prometheus_tsdb_head_samples_appended_total metric shows how many samples Prometheus ingests per second. High ingest rates require more memory for buffering. If you are scraping too frequently or scraping metrics you do not need, increase the scrape interval or use metric_relabel_configs to drop unused metrics.

Security Considerations

Prometheus metrics often contain sensitive information about your infrastructure and application behavior. The /metrics endpoint should never be publicly accessible, but even internal access requires thought.

The metrics themselves can leak information: request counts reveal traffic patterns, error rates expose failure modes, and custom metrics might include user IDs or transaction amounts. Label values are particularly risky because developers sometimes include sensitive data in labels without realizing those labels are stored and queryable.

Prometheus' own query API has no authentication by default. Anyone who can reach the Prometheus pod can execute any query and see all metrics. In multi-tenant clusters, this is a data leak: one tenant can query another tenant's metrics. The standard solution is running Prometheus in a separate namespace with network policies that restrict access, and using a reverse proxy with authentication in front of the query API.

The kube-prometheus-stack creates RBAC roles that give Prometheus access to the Kubernetes API for service discovery. These permissions are quite broad: Prometheus can list pods, services, and endpoints across all namespaces. This is necessary for discovery but means a compromised Prometheus pod has significant cluster access. Use Pod Security Standards to limit what the Prometheus container can do if compromised.

FAQ

What is the difference between Prometheus and Grafana for Kubernetes monitoring?

Prometheus collects, stores, and queries metrics, while Grafana visualizes them. Prometheus is the data source that scrapes metrics from your applications and Kubernetes components. Grafana connects to Prometheus and provides dashboards for viewing that data. You need both: Prometheus for the monitoring backend and Grafana for the visualization layer. The kube-prometheus-stack includes both, pre-configured to work together.

How do I monitor a Kubernetes cluster running in AWS EKS or Google GKE?

Managed Kubernetes services expose node and pod metrics but typically do not expose control plane metrics because the control plane runs outside your cluster. Install Prometheus using the kube-prometheus-stack in your cluster the same way you would on self-hosted Kubernetes. You will get metrics for your applications, pods, nodes, and the kubelet, but not for the API server, scheduler, or controller manager unless the cloud provider exposes them.

Can Prometheus monitor multiple Kubernetes clusters from one instance?

A single Prometheus instance can scrape targets in multiple clusters if it has network access and API credentials for each cluster, but this approach does not scale well. The better pattern is running Prometheus in each cluster and using Thanos or Cortex to provide a unified query interface across all clusters. This keeps scraping local to each cluster and centralizes only querying, which is more reliable and performs better.

How much storage does Prometheus need for a typical Kubernetes cluster?

Storage requirements depend on the number of metrics and retention period. A cluster with 50 nodes and 500 pods typically generates 5-10 GB of metrics per day. With a 15-day retention period, provision 100-150 GB of storage to allow for growth. Monitor the prometheus_tsdb_storage_blocks_bytes metric to track actual usage. SSDs are strongly recommended because Prometheus' workload is write-heavy with random reads.

Should I use the Prometheus Operator or install Prometheus directly?

Use the Prometheus Operator for Kubernetes deployments. Installing Prometheus directly works for static environments, but Kubernetes is dynamic: services come and go, and manually updating Prometheus configuration for each change is not practical. The operator watches ServiceMonitor resources and automatically updates Prometheus configuration when you deploy new services, which is essential for Kubernetes' declarative model.

How do I reduce Prometheus memory usage without losing important metrics?

First, identify high-cardinality metrics using the query in the troubleshooting section. Drop metrics you do not use with metric_relabel_configs in your ServiceMonitor. Increase scrape intervals from 30s to 60s for metrics that do not need high resolution. If you are monitoring many applications, consider running separate Prometheus instances for different teams or metric types. As a last resort, reduce retention from 15 days to 7 days, but ensure you have remote write configured for long-term storage.

What is the scrape interval and how should I configure it?

The scrape interval determines how often Prometheus collects metrics from each target. The default 30 seconds works for most applications. Decrease to 15 seconds for metrics where you need to detect problems faster, like critical API latency. Increase to 60 seconds or more for infrastructure metrics that change slowly, like disk usage. Shorter intervals increase Prometheus' resource usage and produce more data points, but improve alerting response time and query accuracy for short-lived spikes.

How do I handle Prometheus alerts for deployments and rolling updates?

During deployments, pods restart and briefly produce errors, which can trigger false alerts. Use the for clause in alerting rules to require the condition to persist for several minutes before firing. This filters out transient issues during deployments. For more sophisticated handling, use Alertmanager silences during planned maintenance windows, or create separate alerts for sustained issues versus deployment-related blips.

Can I use Prometheus for application performance monitoring or just infrastructure?

Prometheus excels at both infrastructure and application metrics. Instrument your application code to expose business metrics like checkout completion rate, authentication success rate, or feature usage counts alongside technical metrics like request latency. The same Prometheus instance can monitor infrastructure health and application performance. The distinction matters for alert routing: infrastructure issues go to the platform team, while application metric alerts go to the dev team.

What is the difference between Prometheus metrics and logs?

Metrics are numerical measurements sampled over time, like request count or memory usage. Logs are event records with details about individual occurrences. Prometheus handles metrics but not logs. For logs, use a system like ELK or Loki. Metrics are better for monitoring system health and triggering alerts because they aggregate data. Logs are better for debugging specific issues because they preserve details. A complete observability setup includes both: metrics for monitoring and alerting, logs for troubleshooting.

Conclusion

Monitoring Kubernetes with Prometheus requires understanding both systems' architecture: how Kubernetes' dynamic service discovery integrates with Prometheus' pull-based scraping model, and how to structure metrics and alerts that work with ephemeral infrastructure. The combination of the Prometheus Operator for configuration management, properly instrumented applications exposing metrics, and carefully tuned alerts provides visibility into cluster health and application performance.

Start with the kube-prometheus-stack to get immediate visibility into cluster-level metrics, then add application instrumentation to expose business and technical metrics specific to your services. Focus on alerts that indicate real problems requiring action rather than generating noise. As your cluster scales, plan for horizontal sharding or remote write to handle growing metric volume.

The monitoring system itself requires monitoring: track Prometheus' own resource usage, scrape success rates, and query performance to ensure your visibility tool does not become a blind spot. A well-configured Prometheus setup becomes the foundation for understanding your Kubernetes environment and responding to issues before they impact users.

How to Monitor Kubernetes Clusters with Prometheus

How to Monitor Kubernetes Clusters with Prometheus

Why Prometheus Is Built for Kubernetes Monitoring

Prometheus Architecture in Kubernetes

Installing Prometheus with the Operator Pattern

Understanding Operator Custom Resources

Configuring Service Discovery for Different Target Types

Monitoring Pods Directly

Monitoring Kubernetes Components

Instrumenting Applications for Prometheus

Label Cardinality and Its Cost

Setting Up Critical Alerts

Pod and Node Level Alerts

Alertmanager Configuration

Query Patterns for Kubernetes Metrics

Aggregating Across Pods

Resource Usage Patterns

Percentile Calculations from Histograms

Scaling Prometheus for Large Clusters

Horizontal Sharding

Remote Write for Long-term Storage

Troubleshooting Common Issues

Missing Metrics from Specific Targets

Metrics Exist But Queries Return No Data

Prometheus Running Out of Memory

Security Considerations

FAQ

What is the difference between Prometheus and Grafana for Kubernetes monitoring?

How do I monitor a Kubernetes cluster running in AWS EKS or Google GKE?

Can Prometheus monitor multiple Kubernetes clusters from one instance?

How much storage does Prometheus need for a typical Kubernetes cluster?

Should I use the Prometheus Operator or install Prometheus directly?

How do I reduce Prometheus memory usage without losing important metrics?

What is the scrape interval and how should I configure it?

How do I handle Prometheus alerts for deployments and rolling updates?

Can I use Prometheus for application performance monitoring or just infrastructure?

What is the difference between Prometheus metrics and logs?

Conclusion

Share on Social Media:

Bright SEO Tools