Kubernetes generates an overwhelming amount of telemetry, and the usual failure mode is collecting all of it while watching none of it. This checklist works through the three layers that matter — control plane, nodes, workloads — and ends with the five alerts worth setting up first.

Layer 1: the control plane

If the control plane is unhealthy, everything else is unreliable — including your ability to fix it. Watch:

API server request latency and error rate — the single best proxy for "is the cluster responsive"; slow apiserver means slow deploys, slow autoscaling, slow kubectl
etcd health — leader elections and fsync latency; etcd trouble is cluster trouble
Scheduler pending pods — pods stuck in Pending mean the scheduler can't place work: resource exhaustion, taints, or affinity mistakes
Controller manager work queue depth — a growing queue means reconciliation is falling behind

Layer 2: nodes

Nodes are where capacity problems live. The kubelet and node exporter cover the essentials:

Memory pressure and OOM kills — the kernel killing containers is the most common "mystery restart" cause in production clusters
CPU throttling — workloads with CPU limits can be throttled while the node looks idle; check container_cpu_cfs_throttled_periods_total, not just usage
Disk pressure — full node disks trigger image garbage-collection storms and pod evictions
Network errors — packet drops at the CNI layer surface as inexplicable application timeouts
Node conditions — Ready, MemoryPressure, DiskPressure flapping is an early-warning signal

Layer 3: workloads

This is the layer your users actually feel:

Restart counts and `CrashLoopBackOff` — the canonical "something is wrong" signal; a slow restart leak is as telling as a loop
Deployment replica health — desired vs available replicas catches rollouts that never converge
HPA saturation — an autoscaler pinned at maxReplicas is a capacity incident scheduled for your next traffic spike
Requests vs usage — chronically over-requested resources waste cluster capacity; under-requested ones invite eviction
Application RED metrics — rate, errors, duration per service, which is what the layers below ultimately exist to protect

Collecting it all with the OpenTelemetry Collector

You don't need an agent per concern. The OpenTelemetry Collector as a DaemonSet covers kubelet stats, host metrics, and — critically — enriches every span and log from your apps with Kubernetes metadata via the k8sattributes processor:

receivers:
  kubeletstats:
    collection_interval: 30s
  hostmetrics:
    scrapers: [cpu, memory, disk, network, filesystem]

processors:
  k8sattributes:   # stamps pod, namespace, node onto everything
  batch:

exporters:
  otlphttp:
    endpoint: https://ingest.aiaxoniq.com
    headers:
      x-license-key: ${LICENSE_KEY}

service:
  pipelines:
    metrics:
      receivers: [kubeletstats, hostmetrics]
      processors: [k8sattributes, batch]
      exporters: [otlphttp]

With metadata enrichment in place, "which pod threw these errors, on which node, during which deploy" becomes a filter instead of an investigation.

Five alerts to start with

Pod CrashLoopBackOff or restart count rising in production namespaces
Pending pods older than 10 minutes (scheduling is stuck)
Node NotReady for more than 5 minutes
HPA at max replicas for more than 15 minutes
API server p99 latency or error rate beyond baseline

Five alerts, three layers, one collector — that's a Kubernetes observability foundation you can actually maintain. Everything else (service maps and distributed tracing, SLOs per deployment, cost per namespace) builds on top of this metrics data once it's flowing.

Kubernetes Observability: The Complete Checklist

Layer 1: the control plane

Layer 2: nodes

Layer 3: workloads

Collecting it all with the OpenTelemetry Collector

Five alerts to start with

From Zero to Observability in 15 Minutes

What is aiAxonIQ? A Complete Guide to the Observability Platform

LLM Observability: What to Trace and Why