Monitoring & Observability Cheatsheet

Observability = answer "why is it broken?" from the outside, via three pillars: metrics (numbers over time — Prometheus), logs (discrete events — Loki/ELK), traces (one request across services — OpenTelemetry/Jaeger/Tempo). Senior interviews probe Prometheus' pull model + PromQL, metric types, SLI/SLO/error budgets, and how you'd alert without drowning in noise.

1. The three pillars

Pillar	Answers	Tools
Metrics	"Is it healthy? trending where?" — cheap, aggregatable, alertable	Prometheus, VictoriaMetrics, Datadog
Logs	"What exactly happened on this event?"	Loki, ELK/OpenSearch, Splunk
Traces	"Where did this request spend time / fail across services?"	OpenTelemetry + Jaeger/Tempo

Rule of thumb: alert on metrics (cheap, low-cardinality), investigate with traces (find the slow hop), confirm with logs (the exact error).

2. Prometheus architecture

Pull model: Prometheus scrapes HTTP /metrics endpoints on an interval — targets don't push. Service discovery (k8s, Consul, file) finds targets dynamically.
Exporters expose third-party systems as metrics: node_exporter (host), cAdvisor (containers), blackbox (probes), postgres/redis/kafka exporters.
Pushgateway — only for short-lived batch jobs that can't be scraped (don't abuse it).
TSDB — local time-series store; long-term/HA via remote-write to Thanos/Cortex/Mimir/VictoriaMetrics.
Alertmanager — receives alerts from Prometheus, then dedupes, groups, silences, routes to Slack/PagerDuty/email.

cardinality is the killer Each unique label-set is a separate time series. Putting user_id/request_id in a label = millions of series = Prometheus OOM. Keep labels low-cardinality (bounded sets); high-cardinality belongs in logs/traces, not metrics.

3. Metric types

Type	Use
Counter	Monotonically increasing total (requests, errors). Query with `rate()`.
Gauge	Value up/down (memory, queue depth, temperature).
Histogram	Bucketed observations (latency); gives `_bucket`/`_sum`/`_count` → quantiles via `histogram_quantile()` server-side.
Summary	Client-side quantiles (can't aggregate across instances). Prefer histograms.

histogram, not summary, for p99 across instances Summaries compute quantiles per-instance and can't be merged. Use histograms + histogram_quantile() so you can aggregate p99 across all pods.

4. PromQL essentials

# request rate (per-second, over 5m window)
rate(http_requests_total[5m])

# error ratio
sum(rate(http_requests_total{status=~"5.."}[5m]))
  / sum(rate(http_requests_total[5m]))

# p99 latency from a histogram
histogram_quantile(0.99, sum by (le) (rate(http_request_duration_seconds_bucket[5m])))

# memory headroom
node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes

# top 5 CPU pods
topk(5, sum by (pod) (rate(container_cpu_usage_seconds_total[5m])))

Key functions: rate/irate (per-sec for counters), increase, sum/avg/max by (label), histogram_quantile, predict_linear (disk-full forecast), absent (alert if a series vanishes).

always rate() a counter A raw counter resets on restart and only goes up — graphing it is meaningless. Wrap in rate()/increase(), which handle resets, to get per-second/again-window values.

5. SLI / SLO / error budgets

SLI (indicator) — a measured ratio: e.g. good requests / total (availability), or fraction under 300ms (latency).
SLO (objective) — the target: 99.9% over 30 days.
Error budget — 100% − SLO (0.1% = ~43 min/month). Spend it on risk/velocity; freeze releases when it's exhausted.
SLA — the contractual promise (with penalties); SLO is your internal, tighter target.
Burn-rate alerts — alert when you're consuming the error budget too fast (multi-window: fast burn = page, slow burn = ticket). Far better than static thresholds.

6. What to measure — USE & RED

Method	For	Signals
USE	Resources (host/infra)	Utilization, Saturation, Errors per resource
RED	Request-driven services	Rate, Errors, Duration (percentiles)
Four Golden Signals	User-facing systems	Latency, Traffic, Errors, Saturation

7. Alerting that doesn't suck

Alert on symptoms, not causes — page on "users see errors / latency" (SLO burn), not "CPU 80%". A busy CPU that serves fine is not an incident.
Every page must be actionable + urgent. Non-urgent → ticket/dashboard, not a 3am page.
Use for: to avoid flapping; group + dedupe in Alertmanager; route by severity; silence during maintenance.
Track alert noise — chronic non-actionable pages cause alert fatigue (the real outage risk).

# Prometheus alert rule
- alert: HighErrorRate
  expr: |
    sum(rate(http_requests_total{status=~"5.."}[5m]))
      / sum(rate(http_requests_total[5m])) > 0.05
  for: 10m
  labels: { severity: page }
  annotations: { summary: "5xx ratio > 5% for 10m" }

8. Grafana & the stack

Grafana — dashboards over Prometheus/Loki/Tempo/etc; variables, panels, alerting. Provision dashboards as code (JSON/Jsonnet).
Loki — log aggregation that indexes labels not full text (cheap); query with LogQL (PromQL-like).
Tempo / Jaeger — distributed tracing backends; OpenTelemetry = vendor-neutral instrumentation (SDKs + Collector) for metrics/logs/traces.
Correlate: exemplars link a metric spike → a trace; trace → logs by trace_id.

9. Senior interview Q&A

Three pillars of observability?Metrics (trends/alerting), logs (event detail), traces (request across services). Alert on metrics, debug with traces, confirm with logs.
Pull vs push (Prometheus)?Prometheus pulls/scrapes /metrics from discovered targets. Pull = easy health (scrape fails = target down), central control. Pushgateway only for short batch jobs.
Counter vs gauge vs histogram?Counter only increases (rate it); gauge up/down; histogram buckets observations for server-side quantiles. Summary = client-side quantiles, can't aggregate.
Why is cardinality dangerous?Every label-set = a series. High-cardinality labels (user_id, request_id) explode memory/storage and can OOM Prometheus. Keep labels bounded.
How do you get p99 across all instances?Histograms + histogram_quantile(0.99, sum by (le)(rate(..._bucket[5m]))). Summaries can't be aggregated.
SLI vs SLO vs SLA vs error budget?SLI = measured ratio; SLO = target (99.9%); SLA = contract w/ penalties; error budget = 100%−SLO, the allowed unreliability to spend.
What should you alert on?Symptoms users feel (SLO burn rate: error ratio, latency), not raw resource thresholds. Every page actionable + urgent.
USE vs RED?USE (Utilization/Saturation/Errors) for resources; RED (Rate/Errors/Duration) for services. RED finds the sick service, USE the starved resource.
Why always rate() a counter?Counters reset on restart and only climb; rate()/increase() handle resets and give per-second/windowed values — the meaningful number.
How is Prometheus made HA / long-term?Run redundant scrapers; remote-write to Thanos/Cortex/Mimir/VictoriaMetrics for global view, dedup, downsampling, and long retention.
How do you correlate metrics, traces, logs?Shared labels + trace_id: exemplars jump from a metric to a trace; the trace's spans link to logs by trace_id. OpenTelemetry standardizes the instrumentation.
A dashboard is green but users complain — what's wrong?Likely measuring the wrong thing (averages hiding p99, server-side only, no real user signal) or missing an SLI for the failing journey. Add symptom-based SLOs.

Monitoring & Observability — The Senior Interview Cheatsheet.