Back to Blog
Kubernetes Tutorial

Kubernetes Observability: The Complete Checklist

Pods, nodes, deployments, namespaces โ€” monitoring Kubernetes can feel overwhelming. Start here.

AX

aiAxonIQ Team

Engineering at aiAxonIQ

Feb 10, 202610 min read

Kubernetes generates an overwhelming amount of telemetry, and the usual failure mode is collecting all of it while watching none of it. This checklist works through the three layers that matter โ€” control plane, nodes, workloads โ€” and ends with the five alerts worth setting up first.

Layer 1: the control plane

If the control plane is unhealthy, everything else is unreliable โ€” including your ability to fix it. Watch:

  • API server request latency and error rate โ€” the single best proxy for "is the cluster responsive"; slow apiserver means slow deploys, slow autoscaling, slow kubectl
  • etcd health โ€” leader elections and fsync latency; etcd trouble is cluster trouble
  • Scheduler pending pods โ€” pods stuck in Pending mean the scheduler can't place work: resource exhaustion, taints, or affinity mistakes
  • Controller manager work queue depth โ€” a growing queue means reconciliation is falling behind

Layer 2: nodes

Nodes are where capacity problems live. The kubelet and node exporter cover the essentials:

  • Memory pressure and OOM kills โ€” the kernel killing containers is the most common "mystery restart" cause in production clusters
  • CPU throttling โ€” workloads with CPU limits can be throttled while the node looks idle; check container_cpu_cfs_throttled_periods_total, not just usage
  • Disk pressure โ€” full node disks trigger image garbage-collection storms and pod evictions
  • Network errors โ€” packet drops at the CNI layer surface as inexplicable application timeouts
  • Node conditions โ€” Ready, MemoryPressure, DiskPressure flapping is an early-warning signal

Layer 3: workloads

This is the layer your users actually feel:

  • Restart counts and `CrashLoopBackOff` โ€” the canonical "something is wrong" signal; a slow restart leak is as telling as a loop
  • Deployment replica health โ€” desired vs available replicas catches rollouts that never converge
  • HPA saturation โ€” an autoscaler pinned at maxReplicas is a capacity incident scheduled for your next traffic spike
  • Requests vs usage โ€” chronically over-requested resources waste cluster capacity; under-requested ones invite eviction
  • Application RED metrics โ€” rate, errors, duration per service, which is what the layers below ultimately exist to protect

Collecting it all with the OpenTelemetry Collector

You don't need an agent per concern. The OpenTelemetry Collector as a DaemonSet covers kubelet stats, host metrics, and โ€” critically โ€” enriches every span and log from your apps with Kubernetes metadata via the k8sattributes processor:

receivers:
  kubeletstats:
    collection_interval: 30s
  hostmetrics:
    scrapers: [cpu, memory, disk, network, filesystem]

processors:
  k8sattributes:   # stamps pod, namespace, node onto everything
  batch:

exporters:
  otlphttp:
    endpoint: https://ingest.aiaxoniq.com
    headers:
      x-license-key: ${LICENSE_KEY}

service:
  pipelines:
    metrics:
      receivers: [kubeletstats, hostmetrics]
      processors: [k8sattributes, batch]
      exporters: [otlphttp]

With metadata enrichment in place, "which pod threw these errors, on which node, during which deploy" becomes a filter instead of an investigation.

Five alerts to start with

  • Pod CrashLoopBackOff or restart count rising in production namespaces
  • Pending pods older than 10 minutes (scheduling is stuck)
  • Node NotReady for more than 5 minutes
  • HPA at max replicas for more than 15 minutes
  • API server p99 latency or error rate beyond baseline

Five alerts, three layers, one collector โ€” that's a Kubernetes observability foundation you can actually maintain. Everything else (service maps, SLOs per deployment, cost per namespace) builds on top of this data once it's flowing.

Thanks for reading!

More articles