Kubernetes generates an overwhelming amount of telemetry, and the usual failure mode is collecting all of it while watching none of it. This checklist works through the three layers that matter โ control plane, nodes, workloads โ and ends with the five alerts worth setting up first.
Layer 1: the control plane
If the control plane is unhealthy, everything else is unreliable โ including your ability to fix it. Watch:
- API server request latency and error rate โ the single best proxy for "is the cluster responsive"; slow apiserver means slow deploys, slow autoscaling, slow kubectl
- etcd health โ leader elections and fsync latency; etcd trouble is cluster trouble
- Scheduler pending pods โ pods stuck in
Pendingmean the scheduler can't place work: resource exhaustion, taints, or affinity mistakes - Controller manager work queue depth โ a growing queue means reconciliation is falling behind
Layer 2: nodes
Nodes are where capacity problems live. The kubelet and node exporter cover the essentials:
- Memory pressure and OOM kills โ the kernel killing containers is the most common "mystery restart" cause in production clusters
- CPU throttling โ workloads with CPU limits can be throttled while the node looks idle; check
container_cpu_cfs_throttled_periods_total, not just usage - Disk pressure โ full node disks trigger image garbage-collection storms and pod evictions
- Network errors โ packet drops at the CNI layer surface as inexplicable application timeouts
- Node conditions โ
Ready,MemoryPressure,DiskPressureflapping is an early-warning signal
Layer 3: workloads
This is the layer your users actually feel:
- Restart counts and `CrashLoopBackOff` โ the canonical "something is wrong" signal; a slow restart leak is as telling as a loop
- Deployment replica health โ desired vs available replicas catches rollouts that never converge
- HPA saturation โ an autoscaler pinned at
maxReplicasis a capacity incident scheduled for your next traffic spike - Requests vs usage โ chronically over-requested resources waste cluster capacity; under-requested ones invite eviction
- Application RED metrics โ rate, errors, duration per service, which is what the layers below ultimately exist to protect
Collecting it all with the OpenTelemetry Collector
You don't need an agent per concern. The OpenTelemetry Collector as a DaemonSet covers kubelet stats, host metrics, and โ critically โ enriches every span and log from your apps with Kubernetes metadata via the k8sattributes processor:
receivers:
kubeletstats:
collection_interval: 30s
hostmetrics:
scrapers: [cpu, memory, disk, network, filesystem]
processors:
k8sattributes: # stamps pod, namespace, node onto everything
batch:
exporters:
otlphttp:
endpoint: https://ingest.aiaxoniq.com
headers:
x-license-key: ${LICENSE_KEY}
service:
pipelines:
metrics:
receivers: [kubeletstats, hostmetrics]
processors: [k8sattributes, batch]
exporters: [otlphttp]With metadata enrichment in place, "which pod threw these errors, on which node, during which deploy" becomes a filter instead of an investigation.
Five alerts to start with
- Pod
CrashLoopBackOffor restart count rising in production namespaces - Pending pods older than 10 minutes (scheduling is stuck)
- Node
NotReadyfor more than 5 minutes - HPA at max replicas for more than 15 minutes
- API server p99 latency or error rate beyond baseline
Five alerts, three layers, one collector โ that's a Kubernetes observability foundation you can actually maintain. Everything else (service maps, SLOs per deployment, cost per namespace) builds on top of this data once it's flowing.