SLI, SLO, SLA — three acronyms one letter apart, constantly used interchangeably, and meaning very different things. Getting them straight matters because they form a chain: you measure an SLI, you target an SLO, and you contract an SLA. Mix up the order and you end up promising customers numbers you've never measured.

SLIs: what you measure

A Service Level Indicator is a metric that captures whether users are experiencing your service as healthy. Good SLIs are ratios of good events to total events, measured as close to the user as possible:

Availability — successful requests ÷ total requests
Latency — requests faster than a threshold ÷ total requests (e.g. "p99 under 300ms" expressed as a ratio)
Quality — responses served from the full pipeline ÷ total (catching degraded modes)
Freshness — for pipelines: records processed within X minutes ÷ total records

The classic mistake is picking SLIs that describe your infrastructure instead of your users. CPU utilization is not an SLI — a user has never complained about your CPU. They complain about errors and slowness.

SLOs: the target you commit to internally

A Service Level Objective is a target value for an SLI over a window: "99.9% of requests succeed, measured over 30 days." It's an internal engineering commitment — the line between "reliable enough, go build features" and "stop and fix reliability."

Intuition for what the targets actually allow as downtime per 30 days:

99% — about 7.3 hours
99.9% — about 43 minutes
99.95% — about 22 minutes
99.99% — about 4.3 minutes

Each extra nine costs roughly an order of magnitude more engineering effort. The right SLO is the lowest one your users won't notice — not the highest one your team can brag about.

SLAs: the contract with consequences

A Service Level Agreement is a legal commitment to a customer, with remedies (usually service credits) when it's breached. The operational rule of thumb: your SLA must be looser than your SLO. If you promise customers 99.9%, run your internal objective at 99.95% so your team reacts and fixes long before lawyers get involved. An SLA without an underlying measured SLI/SLO is just a number in a contract you're hoping comes true.

Error budgets: the operational payoff

The error budget is the inverse of the SLO: at 99.9%, you have a 0.1% budget of allowed failure. This reframes reliability from "never break" (impossible) to "spend failure deliberately." Budget left? Ship that risky migration. Budget exhausted? Feature work pauses and reliability work takes over. It ends the velocity-vs-stability argument by making it arithmetic instead of opinion.

Burn-rate alerts: how to page on SLOs

Don't alert on the raw SLI ("error rate above 0.1% for 5 minutes" — too noisy). Alert on how fast you're consuming the budget. A burn rate of 1 means you'll spend the budget exactly by the end of the window; a burn rate of 14 means you'll blow through it in ~2 days. The standard multi-window pattern:

page:   burn_rate > 14.4 over 1h  AND over 5m   (budget gone in ~2 days)
page:   burn_rate >  6   over 6h  AND over 30m  (budget gone in ~5 days)
ticket: burn_rate >  3   over 24h               (slow leak, fix this week)

The short window confirms the problem is *still happening*; the long window confirms it's *significant*. Together they page you for real incidents and stay quiet for blips. This is the alerting model aiAxonIQ's SLO tracking ships with — multi-window burn-rate alerts on error budgets, rather than raw-threshold noise. See Alerting for the full picture.

TL;DR

SLI — the measurement (ratio of good events to total)
SLO — the internal target for that measurement, plus a time window
SLA — the external contract, always looser than the SLO
Error budget — the allowed failure the SLO implies; spend it deliberately
Burn rate — how fast you're spending it; alert on this, not raw error rates

SLOs, SLAs, and SLIs: A No-Fluff Explainer

SLIs: what you measure

SLOs: the target you commit to internally

SLAs: the contract with consequences

Error budgets: the operational payoff

Burn-rate alerts: how to page on SLOs

TL;DR

What is aiAxonIQ? A Complete Guide to the Observability Platform

LLM Observability: What to Trace and Why

OpenTelemetry Collector vs Direct SDK Export: Which Should You Use?