Curated reliability toolkit — observability, alerting, incident response, SLOs, chaos, and load
testing. Verdict tags: ★ default pick, solid,
niche, commercial. Principle: instrument with open standards
(OpenTelemetry), alert on symptoms (SLO burn), and practice failure before it happens.
Metrics & dashboards
| Tool | Verdict | Review |
| Prometheus | ★ default | The open metrics standard — pull-based, PromQL, exporters. Watch cardinality. |
| Grafana | ★ default | Dashboards over Prometheus/Loki/Tempo + unified alerting. Provision as code. |
| VictoriaMetrics / Thanos / Mimir | solid | Long-term, HA, multi-tenant Prometheus storage at scale. |
| Datadog / Grafana Cloud | commercial | Managed all-in-one — fast to value, expensive at scale. |
Logs & traces
| Tool | Verdict | Review |
| OpenTelemetry | ★ (instrument) | Vendor-neutral metrics/logs/traces SDKs + Collector. Instrument once, export anywhere. Always start here. |
| Loki | ★ default | Log aggregation indexing labels not full text — cheap, Grafana-native (LogQL). |
| Tempo / Jaeger | solid | Distributed tracing backends. Tempo = cheap object-store traces, Grafana-native. |
| ELK / OpenSearch | solid | Full-text log search + analytics. Powerful, heavier to operate than Loki. |
Alerting & on-call
| Tool | Verdict | Review |
| Alertmanager | ★ default | Prometheus's alert router — dedupe, group, silence, route by severity. The open default. |
| PagerDuty / Opsgenie | commercial | On-call scheduling, escalation, paging. Industry standard for serious on-call. |
| Grafana OnCall | solid | Open-source on-call/escalation that pairs with Grafana alerting. |
alert on symptoms, not causes
Page on user-facing pain (SLO burn rate: errors, latency), not "CPU 80%". Every page must be
urgent + actionable; everything else is a ticket. Alert fatigue is itself an outage risk.
SLO tracking & error budgets
| Tool | Verdict | Review |
| Sloth | ★ default | Generate Prometheus SLO rules + multi-window burn-rate alerts from simple specs. Open, lightweight. |
| Pyrra | solid | SLO definitions + UI on Prometheus. Nice dashboards. |
| Nobl9 | commercial | Dedicated SLO platform across many data sources. For orgs formalizing SLOs. |
Incident management
| Tool | Verdict | Review |
| incident.io / FireHydrant / Rootly | commercial | Slack-native incident orchestration — roles, timeline, comms, postmortems. Big time-saver mid-incident. |
| Statuspage / Atlassian Statuspage | solid | Public status communication during outages. |
| Backstage | solid | Developer portal — service catalog, ownership, runbooks, scorecards. Know who owns what at 3am. |
Chaos engineering
| Tool | Verdict | Review |
| Chaos Mesh | ★ (k8s) | CNCF chaos for Kubernetes — pod/network/IO/time faults via CRDs. Default for k8s. |
| LitmusChaos | solid | k8s chaos with a large experiment hub + GitOps workflows. |
| Gremlin | commercial | Polished SaaS chaos with safety guardrails. Easiest to adopt for teams new to chaos. |
chaos needs a hypothesis + blast radius
Don't break prod randomly. State a hypothesis ("losing one AZ stays within SLO"), limit blast
radius, run in a controlled window with an abort, then verify. It's an experiment, not vandalism.
Load & performance testing
| Tool | Verdict | Review |
| k6 | ★ default | Scriptable load testing in JS, great CI integration + output. The modern default. |
| Locust | solid | Python-defined load tests, distributed, nice web UI. Good if your team is Python. |
| Gatling / JMeter | solid/legacy | Gatling = high-perf Scala/Java; JMeter = old, GUI-heavy, ubiquitous. |
A sensible default stack
- Instrument: OpenTelemetry SDKs + Collector everywhere.
- Metrics + dashboards + alerting: Prometheus + Grafana + Alertmanager (long-term via Mimir/VictoriaMetrics).
- Logs + traces: Loki + Tempo (Grafana-native).
- SLOs: Sloth-generated burn-rate alerts; page on symptoms only.
- On-call + incidents: PagerDuty/Grafana OnCall + incident.io/FireHydrant + a public status page.
- Practice failure: k6 load tests in CI + Chaos Mesh game days.
- Ownership: Backstage catalog + runbooks linked from alerts.
observability is for unknown-unknowns
Dashboards answer questions you anticipated; high-cardinality traces/logs + good correlation let
you ask new questions during a novel incident. Instrument richly, alert sparingly.