SLI, SLO, SLA โ three acronyms one letter apart, constantly used interchangeably, and meaning very different things. Getting them straight matters because they form a chain: you measure an SLI, you target an SLO, and you contract an SLA. Mix up the order and you end up promising customers numbers you've never measured.
SLIs: what you measure
A Service Level Indicator is a metric that captures whether users are experiencing your service as healthy. Good SLIs are ratios of good events to total events, measured as close to the user as possible:
- Availability โ successful requests รท total requests
- Latency โ requests faster than a threshold รท total requests (e.g. "p99 under 300ms" expressed as a ratio)
- Quality โ responses served from the full pipeline รท total (catching degraded modes)
- Freshness โ for pipelines: records processed within X minutes รท total records
The classic mistake is picking SLIs that describe your infrastructure instead of your users. CPU utilization is not an SLI โ a user has never complained about your CPU. They complain about errors and slowness.
SLOs: the target you commit to internally
A Service Level Objective is a target value for an SLI over a window: "99.9% of requests succeed, measured over 30 days." It's an internal engineering commitment โ the line between "reliable enough, go build features" and "stop and fix reliability."
Intuition for what the targets actually allow as downtime per 30 days:
- 99% โ about 7.3 hours
- 99.9% โ about 43 minutes
- 99.95% โ about 22 minutes
- 99.99% โ about 4.3 minutes
Each extra nine costs roughly an order of magnitude more engineering effort. The right SLO is the lowest one your users won't notice โ not the highest one your team can brag about.
SLAs: the contract with consequences
A Service Level Agreement is a legal commitment to a customer, with remedies (usually service credits) when it's breached. The operational rule of thumb: your SLA must be looser than your SLO. If you promise customers 99.9%, run your internal objective at 99.95% so your team reacts and fixes long before lawyers get involved. An SLA without an underlying measured SLI/SLO is just a number in a contract you're hoping comes true.
Error budgets: the operational payoff
The error budget is the inverse of the SLO: at 99.9%, you have a 0.1% budget of allowed failure. This reframes reliability from "never break" (impossible) to "spend failure deliberately." Budget left? Ship that risky migration. Budget exhausted? Feature work pauses and reliability work takes over. It ends the velocity-vs-stability argument by making it arithmetic instead of opinion.
Burn-rate alerts: how to page on SLOs
Don't alert on the raw SLI ("error rate above 0.1% for 5 minutes" โ too noisy). Alert on how fast you're consuming the budget. A burn rate of 1 means you'll spend the budget exactly by the end of the window; a burn rate of 14 means you'll blow through it in ~2 days. The standard multi-window pattern:
page: burn_rate > 14.4 over 1h AND over 5m (budget gone in ~2 days) page: burn_rate > 6 over 6h AND over 30m (budget gone in ~5 days) ticket: burn_rate > 3 over 24h (slow leak, fix this week)
The short window confirms the problem is *still happening*; the long window confirms it's *significant*. Together they page you for real incidents and stay quiet for blips. This is the alerting model aiAxonIQ's SLO tracking ships with โ multi-window burn-rate alerts on error budgets, rather than raw-threshold noise.
TL;DR
- SLI โ the measurement (ratio of good events to total)
- SLO โ the internal target for that measurement, plus a time window
- SLA โ the external contract, always looser than the SLO
- Error budget โ the allowed failure the SLO implies; spend it deliberately
- Burn rate โ how fast you're spending it; alert on this, not raw error rates