← Toolboxes

TOOLBOX · SRE · CURATED + REVIEWED

The SRE Toolbox.

sre tools reviews observability
Curated reliability toolkit — observability, alerting, incident response, SLOs, chaos, and load testing. Verdict tags: ★ default pick, solid, niche, commercial. Principle: instrument with open standards (OpenTelemetry), alert on symptoms (SLO burn), and practice failure before it happens.

Metrics & dashboards

ToolVerdictReview
Prometheus★ defaultThe open metrics standard — pull-based, PromQL, exporters. Watch cardinality.
Grafana★ defaultDashboards over Prometheus/Loki/Tempo + unified alerting. Provision as code.
VictoriaMetrics / Thanos / MimirsolidLong-term, HA, multi-tenant Prometheus storage at scale.
Datadog / Grafana CloudcommercialManaged all-in-one — fast to value, expensive at scale.

Logs & traces

ToolVerdictReview
OpenTelemetry★ (instrument)Vendor-neutral metrics/logs/traces SDKs + Collector. Instrument once, export anywhere. Always start here.
Loki★ defaultLog aggregation indexing labels not full text — cheap, Grafana-native (LogQL).
Tempo / JaegersolidDistributed tracing backends. Tempo = cheap object-store traces, Grafana-native.
ELK / OpenSearchsolidFull-text log search + analytics. Powerful, heavier to operate than Loki.

Alerting & on-call

ToolVerdictReview
Alertmanager★ defaultPrometheus's alert router — dedupe, group, silence, route by severity. The open default.
PagerDuty / OpsgeniecommercialOn-call scheduling, escalation, paging. Industry standard for serious on-call.
Grafana OnCallsolidOpen-source on-call/escalation that pairs with Grafana alerting.
alert on symptoms, not causes Page on user-facing pain (SLO burn rate: errors, latency), not "CPU 80%". Every page must be urgent + actionable; everything else is a ticket. Alert fatigue is itself an outage risk.

SLO tracking & error budgets

ToolVerdictReview
Sloth★ defaultGenerate Prometheus SLO rules + multi-window burn-rate alerts from simple specs. Open, lightweight.
PyrrasolidSLO definitions + UI on Prometheus. Nice dashboards.
Nobl9commercialDedicated SLO platform across many data sources. For orgs formalizing SLOs.

Incident management

ToolVerdictReview
incident.io / FireHydrant / RootlycommercialSlack-native incident orchestration — roles, timeline, comms, postmortems. Big time-saver mid-incident.
Statuspage / Atlassian StatuspagesolidPublic status communication during outages.
BackstagesolidDeveloper portal — service catalog, ownership, runbooks, scorecards. Know who owns what at 3am.

Chaos engineering

ToolVerdictReview
Chaos Mesh★ (k8s)CNCF chaos for Kubernetes — pod/network/IO/time faults via CRDs. Default for k8s.
LitmusChaossolidk8s chaos with a large experiment hub + GitOps workflows.
GremlincommercialPolished SaaS chaos with safety guardrails. Easiest to adopt for teams new to chaos.
chaos needs a hypothesis + blast radius Don't break prod randomly. State a hypothesis ("losing one AZ stays within SLO"), limit blast radius, run in a controlled window with an abort, then verify. It's an experiment, not vandalism.

Load & performance testing

ToolVerdictReview
k6★ defaultScriptable load testing in JS, great CI integration + output. The modern default.
LocustsolidPython-defined load tests, distributed, nice web UI. Good if your team is Python.
Gatling / JMetersolid/legacyGatling = high-perf Scala/Java; JMeter = old, GUI-heavy, ubiquitous.

A sensible default stack

  • Instrument: OpenTelemetry SDKs + Collector everywhere.
  • Metrics + dashboards + alerting: Prometheus + Grafana + Alertmanager (long-term via Mimir/VictoriaMetrics).
  • Logs + traces: Loki + Tempo (Grafana-native).
  • SLOs: Sloth-generated burn-rate alerts; page on symptoms only.
  • On-call + incidents: PagerDuty/Grafana OnCall + incident.io/FireHydrant + a public status page.
  • Practice failure: k6 load tests in CI + Chaos Mesh game days.
  • Ownership: Backstage catalog + runbooks linked from alerts.
observability is for unknown-unknowns Dashboards answer questions you anticipated; high-cardinality traces/logs + good correlation let you ask new questions during a novel incident. Instrument richly, alert sparingly.
← prev: AI Engineering all toolboxes →
© cvam — written in plaintext, served warm