SRE — cvam.sight

Curated reliability toolkit — observability, alerting, incident response, SLOs, chaos, and load testing. Verdict tags: ★ default pick, solid, niche, commercial. Principle: instrument with open standards (OpenTelemetry), alert on symptoms (SLO burn), and practice failure before it happens.

Metrics & dashboards

Tool	Verdict	Review
Prometheus	★ default	The open metrics standard — pull-based, PromQL, exporters. Watch cardinality.
Grafana	★ default	Dashboards over Prometheus/Loki/Tempo + unified alerting. Provision as code.
VictoriaMetrics / Thanos / Mimir	solid	Long-term, HA, multi-tenant Prometheus storage at scale.
Datadog / Grafana Cloud	commercial	Managed all-in-one — fast to value, expensive at scale.

Logs & traces

Tool	Verdict	Review
OpenTelemetry	★ (instrument)	Vendor-neutral metrics/logs/traces SDKs + Collector. Instrument once, export anywhere. Always start here.
Loki	★ default	Log aggregation indexing labels not full text — cheap, Grafana-native (LogQL).
Tempo / Jaeger	solid	Distributed tracing backends. Tempo = cheap object-store traces, Grafana-native.
ELK / OpenSearch	solid	Full-text log search + analytics. Powerful, heavier to operate than Loki.

Alerting & on-call

Tool	Verdict	Review
Alertmanager	★ default	Prometheus's alert router — dedupe, group, silence, route by severity. The open default.
PagerDuty / Opsgenie	commercial	On-call scheduling, escalation, paging. Industry standard for serious on-call.
Grafana OnCall	solid	Open-source on-call/escalation that pairs with Grafana alerting.

alert on symptoms, not causes Page on user-facing pain (SLO burn rate: errors, latency), not "CPU 80%". Every page must be urgent + actionable; everything else is a ticket. Alert fatigue is itself an outage risk.

SLO tracking & error budgets

Tool	Verdict	Review
Sloth	★ default	Generate Prometheus SLO rules + multi-window burn-rate alerts from simple specs. Open, lightweight.
Pyrra	solid	SLO definitions + UI on Prometheus. Nice dashboards.
Nobl9	commercial	Dedicated SLO platform across many data sources. For orgs formalizing SLOs.

Incident management

Tool	Verdict	Review
incident.io / FireHydrant / Rootly	commercial	Slack-native incident orchestration — roles, timeline, comms, postmortems. Big time-saver mid-incident.
Statuspage / Atlassian Statuspage	solid	Public status communication during outages.
Backstage	solid	Developer portal — service catalog, ownership, runbooks, scorecards. Know who owns what at 3am.

Chaos engineering

Tool	Verdict	Review
Chaos Mesh	★ (k8s)	CNCF chaos for Kubernetes — pod/network/IO/time faults via CRDs. Default for k8s.
LitmusChaos	solid	k8s chaos with a large experiment hub + GitOps workflows.
Gremlin	commercial	Polished SaaS chaos with safety guardrails. Easiest to adopt for teams new to chaos.

chaos needs a hypothesis + blast radius Don't break prod randomly. State a hypothesis ("losing one AZ stays within SLO"), limit blast radius, run in a controlled window with an abort, then verify. It's an experiment, not vandalism.

Load & performance testing

Tool	Verdict	Review
k6	★ default	Scriptable load testing in JS, great CI integration + output. The modern default.
Locust	solid	Python-defined load tests, distributed, nice web UI. Good if your team is Python.
Gatling / JMeter	solid/legacy	Gatling = high-perf Scala/Java; JMeter = old, GUI-heavy, ubiquitous.

A sensible default stack

Instrument: OpenTelemetry SDKs + Collector everywhere.
Metrics + dashboards + alerting: Prometheus + Grafana + Alertmanager (long-term via Mimir/VictoriaMetrics).
Logs + traces: Loki + Tempo (Grafana-native).
SLOs: Sloth-generated burn-rate alerts; page on symptoms only.
On-call + incidents: PagerDuty/Grafana OnCall + incident.io/FireHydrant + a public status page.
Practice failure: k6 load tests in CI + Chaos Mesh game days.
Ownership: Backstage catalog + runbooks linked from alerts.

observability is for unknown-unknowns Dashboards answer questions you anticipated; high-cardinality traces/logs + good correlation let you ask new questions during a novel incident. Instrument richly, alert sparingly.

The SRE Toolbox.