Incident Practices That Actually Improve Reliability
How to build incident practices that reduce future incidents, not just resolve current ones.
Goal
Build incident practices that:
- Minimize customer impact during incidents
- Extract maximum learning from every incident
- Drive systemic improvements (not just firefighting)
- Maintain team health and avoid burnout
Scope
In scope:
- Incident detection and response
- Severity classification
- Postmortem process
- Action item tracking and completion
Out of scope:
- On-call rotation design (see Reliability & Incidents)
- SLO definition (see Metrics & Impact)
- Chaos engineering (see Reliability & Incidents)
Principles
-
Blame-free, but not consequence-free — People aren’t blamed, but systemic issues must be fixed
-
Every incident is a gift — Production incidents reveal gaps that testing cannot find
-
Action items have owners and deadlines — No “we should” without “who” and “when”
How it Works
Phase 1: Detection & Declaration (0-5 minutes)
- Automated alert or human observation identifies issue
- On-call acknowledges and assesses severity
- If Sev1/Sev2, declare incident in #incidents channel
- Incident commander assigned (may be different from on-call)
Severity definitions:
| Severity | Impact | Response |
|---|---|---|
| Sev1 | Customer-facing outage | All-hands, immediate |
| Sev2 | Significant degradation | Core team, within 15 min |
| Sev3 | Minor impact | On-call, within 1 hour |
| Sev4 | No immediate impact | Next business day |
Phase 2: Response (Duration of incident)
Roles:
- Incident Commander (IC): Coordinates response, not hands-on-keyboard
- Technical Lead: Drives investigation and remediation
- Communications: Updates stakeholders (internal and external)
- Scribe: Documents timeline and decisions
IC responsibilities:
- Maintain shared understanding of current state
- Assign tasks and track progress
- Make decisions when team is stuck
- Escalate when needed
Phase 3: Resolution & Stabilization (1-24 hours post-incident)
- Confirm customer impact resolved
- Implement temporary mitigations if needed
- Monitor for recurrence
- Schedule postmortem (within 48 hours for Sev1/2)
Phase 4: Postmortem (Within 1 week)
Postmortem document structure:
- Summary: What happened in 2-3 sentences
- Timeline: Minute-by-minute reconstruction
- Impact: Customer and business impact, quantified
- Root cause: Contributing factors (usually multiple)
- What went well: What helped resolve faster
- What could improve: Process gaps
- Action items: Specific, assigned, deadlined
Postmortem meeting:
- Review document (shared in advance)
- Focus on systemic improvements
- Assign action item owners
- Schedule follow-up for action item review
Phase 5: Follow-through (Ongoing)
- Action items tracked in central system
- Weekly review of open items
- Monthly report on incident trends
- Quarterly review of systemic patterns
Rituals & Cadence
| Ritual | Frequency | Duration | Participants |
|---|---|---|---|
| Postmortem meeting | After Sev1/2 | 60 min | Involved engineers + leadership |
| Action item review | Weekly | 30 min | Tech leads |
| Incident trends review | Monthly | 30 min | Engineering leadership |
| On-call retrospective | Quarterly | 60 min | All on-call participants |
Artifacts
- Incident template: Standard format for incident channels
- Postmortem template: Structured document format
- Action item tracker: Central board with owners and deadlines
- Incident dashboard: Real-time and historical incident metrics
Metrics
| Metric | Target | Warning | Critical |
|---|---|---|---|
| Time to detect (TTD) | Under 5 min | Under 15 min | Over 30 min |
| Time to mitigate (TTM) | Under 30 min | Under 1 hour | Over 2 hours |
| Postmortem completion rate | 100% (Sev1/2) | 90% | 75% |
| Action item completion rate | 80% within 30 days | 60% | 40% |
| Repeat incidents (same root cause) | Under 10% | Under 20% | Over 20% |
Guardrails
- No postmortem, no close — Sev1/2 incidents stay open until postmortem is complete
- Action items have deadlines — No unbounded “we should” items
- Monthly action item review — Stale items escalated to leadership
- Blameless but honest — Name contributing factors, even if uncomfortable
Incident Handling
Signs of breakdown:
- Postmortems become perfunctory
- Action items pile up without completion
- Same root causes repeat
- Team dreads on-call
Response:
- Leadership review of recent incidents
- Prioritize completing action items over new features
- Address on-call burden (rotation size, tooling, processes)
- Celebrate improvements, not just fire-fighting
Common Failure Modes
-
Postmortem theater: Documents written but not read, meetings held but changes not made. Fixed by tracking repeat incident rate and action item completion.
-
Action item graveyard: Items assigned but never completed. Fixed by weekly review, deadline enforcement, and linking to sprint planning.
-
Blame avoidance: Postmortems avoid naming contributing factors to protect individuals. Fixed by emphasizing systemic factors and leadership modeling blame-free language.
Change Management
- Feedback loop: Quarterly on-call retrospective
- Review cadence: Semi-annual review of incident process
- Change process: Process changes proposed as action items from incidents