Reliability & Incidents
On-call practices, postmortems, SLOs, and incident response. How to keep systems healthy and learn from failures.
Topics
On-Call Practices
Rotation design, escalation paths, runbooks, and burnout prevention.
Incident Response
Severity classification, incident commanders, communication templates, and war rooms.
Postmortems
Blameless postmortem format, action item tracking, and pattern detection.
SLOs & SLIs
Setting meaningful SLOs, error budgets, and alerting strategies.
Chaos Engineering
Game days, failure injection, and resilience testing.
Last updated on