PromQL looks intimidating, but real-world usage collapses into a small set of patterns. Learn these ten and you can read almost any dashboard and write almost any alert. Each one below is copy-paste-adaptable โ swap in your own metric names.
1. Request rate โ rate() on a counter
Counters only go up, so their raw value is useless. rate() turns them into per-second throughput:
rate(http_requests_total[5m])
The [5m] window smooths the result; shorter windows react faster but are noisier. Rule of thumb: at least 4ร your scrape interval.
2. Error ratio โ dividing two rates
The single most useful alerting expression in PromQL:
sum(rate(http_requests_total{status=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))A ratio is robust to traffic changes โ 50 errors/sec means nothing without knowing whether that's out of 100 or 100,000.
3. Latency percentiles โ histogram_quantile
histogram_quantile(0.95, sum by (le) (rate(http_request_duration_seconds_bucket[5m])))
The le label is the histogram bucket boundary and must survive the sum by โ forget it and you get nothing. Averages hide tail pain; percentiles are what users feel.
4. Group by label โ sum by
sum by (service) (rate(http_requests_total[5m]))
The aggregation workhorse. sum by (service, status) for a two-dimensional breakdown; without (instance) to keep everything except one label.
5. Top offenders โ topk
topk(5, sum by (service) (rate(http_requests_total{status=~"5.."}[5m])))Perfect for "which five services are throwing the most errors right now" panels โ the first thing to glance at during an incident.
6. Growth over a day โ increase
increase(payment_failures_total[24h])
increase is rate ร window: total events over the period instead of per-second. Right tool for "how many failures today" questions.
7. Is it even there? โ absent()
absent(up{job="payment-service"})Fires when the series doesn't exist at all. Without it, a dead service produces no data, no data matches no threshold, and your alerts stay green while production burns. Pair with up == 0 for scrape failures.
8. Saturation โ how close to the ceiling
1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)
Utilization tells you what's used; saturation tells you what's left before things break. The same shape works for disk, file descriptors, and connection pools.
9. Smoothing spikes โ avg_over_time
avg_over_time(queue_depth[1h])
For gauges that bounce around, the time-smoothed value separates "momentary blip" from "sustained problem" โ often the difference between a page and a ticket.
10. Will we run out? โ predict_linear
predict_linear(node_filesystem_avail_bytes[6h], 24 * 3600) < 0
Linear extrapolation: "at the current trend, will the disk be full in 24 hours?" Alerting on predicted exhaustion beats alerting on 90%-full โ you get hours of runway instead of minutes.
Where to go next
These ten cover the bulk of daily PromQL. Combine them โ topk over an error ratio, avg_over_time over saturation โ and you're writing expert-level queries from a beginner-sized toolbox. And when you'd rather not hand-write them: aiAxonIQ speaks Prometheus remote-write natively, its natural-language query feature generates expressions like these from plain English, and Prophet-based forecasting handles the trend-prediction use case with seasonality that predict_linear's straight line can't capture.