Observability = answer "why is it broken?" from the outside, via three pillars: metrics (numbers over time — Prometheus), logs (discrete events — Loki/ELK), traces (one request across services — OpenTelemetry/Jaeger/Tempo). Senior interviews probe Prometheus' pull model + PromQL, metric types, SLI/SLO/error budgets, and how you'd alert without drowning in noise.
1. The three pillars
| Pillar | Answers | Tools |
|---|---|---|
| Metrics | "Is it healthy? trending where?" — cheap, aggregatable, alertable | Prometheus, VictoriaMetrics, Datadog |
| Logs | "What exactly happened on this event?" | Loki, ELK/OpenSearch, Splunk |
| Traces | "Where did this request spend time / fail across services?" | OpenTelemetry + Jaeger/Tempo |
Rule of thumb: alert on metrics (cheap, low-cardinality), investigate with traces (find the slow hop), confirm with logs (the exact error).
2. Prometheus architecture
- Pull model: Prometheus scrapes HTTP
/metricsendpoints on an interval — targets don't push. Service discovery (k8s, Consul, file) finds targets dynamically. - Exporters expose third-party systems as metrics: node_exporter (host), cAdvisor (containers), blackbox (probes), postgres/redis/kafka exporters.
- Pushgateway — only for short-lived batch jobs that can't be scraped (don't abuse it).
- TSDB — local time-series store; long-term/HA via remote-write to Thanos/Cortex/Mimir/VictoriaMetrics.
- Alertmanager — receives alerts from Prometheus, then dedupes, groups, silences, routes to Slack/PagerDuty/email.
cardinality is the killer
Each unique label-set is a separate time series. Putting
user_id/request_id
in a label = millions of series = Prometheus OOM. Keep labels low-cardinality (bounded sets);
high-cardinality belongs in logs/traces, not metrics.3. Metric types
| Type | Use |
|---|---|
| Counter | Monotonically increasing total (requests, errors). Query with rate(). |
| Gauge | Value up/down (memory, queue depth, temperature). |
| Histogram | Bucketed observations (latency); gives _bucket/_sum/_count → quantiles via histogram_quantile() server-side. |
| Summary | Client-side quantiles (can't aggregate across instances). Prefer histograms. |
histogram, not summary, for p99 across instances
Summaries compute quantiles per-instance and can't be merged. Use histograms +
histogram_quantile() so you can aggregate p99 across all pods.4. PromQL essentials
# request rate (per-second, over 5m window)
rate(http_requests_total[5m])
# error ratio
sum(rate(http_requests_total{status=~"5.."}[5m]))
/ sum(rate(http_requests_total[5m]))
# p99 latency from a histogram
histogram_quantile(0.99, sum by (le) (rate(http_request_duration_seconds_bucket[5m])))
# memory headroom
node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes
# top 5 CPU pods
topk(5, sum by (pod) (rate(container_cpu_usage_seconds_total[5m])))
Key functions: rate/irate (per-sec for counters), increase,
sum/avg/max by (label), histogram_quantile, predict_linear
(disk-full forecast), absent (alert if a series vanishes).
always rate() a counter
A raw counter resets on restart and only goes up — graphing it is meaningless. Wrap in
rate()/increase(), which handle resets, to get per-second/again-window values.5. SLI / SLO / error budgets
- SLI (indicator) — a measured ratio: e.g. good requests / total (availability), or fraction under 300ms (latency).
- SLO (objective) — the target: 99.9% over 30 days.
- Error budget — 100% − SLO (0.1% = ~43 min/month). Spend it on risk/velocity; freeze releases when it's exhausted.
- SLA — the contractual promise (with penalties); SLO is your internal, tighter target.
- Burn-rate alerts — alert when you're consuming the error budget too fast (multi-window: fast burn = page, slow burn = ticket). Far better than static thresholds.
6. What to measure — USE & RED
| Method | For | Signals |
|---|---|---|
| USE | Resources (host/infra) | Utilization, Saturation, Errors per resource |
| RED | Request-driven services | Rate, Errors, Duration (percentiles) |
| Four Golden Signals | User-facing systems | Latency, Traffic, Errors, Saturation |
7. Alerting that doesn't suck
- Alert on symptoms, not causes — page on "users see errors / latency" (SLO burn), not "CPU 80%". A busy CPU that serves fine is not an incident.
- Every page must be actionable + urgent. Non-urgent → ticket/dashboard, not a 3am page.
- Use
for:to avoid flapping; group + dedupe in Alertmanager; route by severity; silence during maintenance. - Track alert noise — chronic non-actionable pages cause alert fatigue (the real outage risk).
# Prometheus alert rule
- alert: HighErrorRate
expr: |
sum(rate(http_requests_total{status=~"5.."}[5m]))
/ sum(rate(http_requests_total[5m])) > 0.05
for: 10m
labels: { severity: page }
annotations: { summary: "5xx ratio > 5% for 10m" }
8. Grafana & the stack
- Grafana — dashboards over Prometheus/Loki/Tempo/etc; variables, panels, alerting. Provision dashboards as code (JSON/Jsonnet).
- Loki — log aggregation that indexes labels not full text (cheap); query with LogQL (PromQL-like).
- Tempo / Jaeger — distributed tracing backends; OpenTelemetry = vendor-neutral instrumentation (SDKs + Collector) for metrics/logs/traces.
- Correlate: exemplars link a metric spike → a trace; trace → logs by trace_id.
9. Senior interview Q&A
- Three pillars of observability?Metrics (trends/alerting), logs (event detail), traces (request across services). Alert on metrics, debug with traces, confirm with logs.
- Pull vs push (Prometheus)?Prometheus pulls/scrapes /metrics from discovered targets. Pull = easy health (scrape fails = target down), central control. Pushgateway only for short batch jobs.
- Counter vs gauge vs histogram?Counter only increases (rate it); gauge up/down; histogram buckets observations for server-side quantiles. Summary = client-side quantiles, can't aggregate.
- Why is cardinality dangerous?Every label-set = a series. High-cardinality labels (user_id, request_id) explode memory/storage and can OOM Prometheus. Keep labels bounded.
- How do you get p99 across all instances?Histograms + histogram_quantile(0.99, sum by (le)(rate(..._bucket[5m]))). Summaries can't be aggregated.
- SLI vs SLO vs SLA vs error budget?SLI = measured ratio; SLO = target (99.9%); SLA = contract w/ penalties; error budget = 100%−SLO, the allowed unreliability to spend.
- What should you alert on?Symptoms users feel (SLO burn rate: error ratio, latency), not raw resource thresholds. Every page actionable + urgent.
- USE vs RED?USE (Utilization/Saturation/Errors) for resources; RED (Rate/Errors/Duration) for services. RED finds the sick service, USE the starved resource.
- Why always rate() a counter?Counters reset on restart and only climb; rate()/increase() handle resets and give per-second/windowed values — the meaningful number.
- How is Prometheus made HA / long-term?Run redundant scrapers; remote-write to Thanos/Cortex/Mimir/VictoriaMetrics for global view, dedup, downsampling, and long retention.
- How do you correlate metrics, traces, logs?Shared labels + trace_id: exemplars jump from a metric to a trace; the trace's spans link to logs by trace_id. OpenTelemetry standardizes the instrumentation.
- A dashboard is green but users complain — what's wrong?Likely measuring the wrong thing (averages hiding p99, server-side only, no real user signal) or missing an SLI for the failing journey. Add symptom-based SLOs.