Part IV (e) - Incident Command & On-Call Engineering
HARD TRUTH: INCIDENTS EXPOSE REAL LEADERSHIP
When production is unstable, technical skill matters, but leadership quality matters more.
Teams that perform well in incidents do three things consistently:
- Establish command quickly
- Communicate clearly
- Make reversible decisions under pressure
Incident maturity is not heroics. It is repeatable system behavior under pressure.
SEVERITY MODEL
Define SEV-1 to SEV-4 using objective criteria:
- User impact percentage
- Revenue or operational impact
- Time sensitivity
- Compliance/security implications
Each severity should map to:
- Maximum response start time
- Required roles
- Required stakeholder groups
- Update cadence
Failure pattern: vague severities turn incidents into political debate while customers are still down.
Example Severity Policy
| Severity | Example Impact | Response Start | Update Cadence | Required Roles |
|---|---|---|---|---|
SEV-1 | Core revenue path or login unavailable for most users | 5 min | Every 10 min | Commander, Comms, Ops, Scribe |
SEV-2 | Major feature degraded or large region affected | 15 min | Every 15 min | Commander, Ops, Scribe |
SEV-3 | Partial feature degradation with workaround | 30 min | Every 30 min | Primary on-call, owner |
SEV-4 | Low-risk issue or internal degradation only | Business hours | As needed | Service owner |
Use examples, not vague adjectives, when defining severity. "Checkout error rate above 20% for 10 minutes" is usable. "Serious outage" is not.
INCIDENT ROLES
For medium and high severity incidents, assign explicit roles:
- Incident Commander: owns decisions and direction
- Communications Lead: owns internal and external updates
- Ops Lead: coordinates mitigation execution
- Scribe: maintains timeline and decision log
Rules:
- One channel, one command
- No major action without owner and ETA
- Avoid role confusion by naming roles in channel topic
Field rule: clear roles reduce chaos faster than clever debugging.
COMMAND LOOP (EVERY 10-15 MINUTES)
Run a strict loop:
- Current impact and scope
- Best hypothesis
- Actions in progress
- Decision needed now
- Next update time
This loop keeps everyone synchronized and prevents tunnel vision.
Do not allow uncontrolled parallel experiments. They add hidden risk and destroy timeline clarity.
COMMS DISCIPLINE
Strong incident communication is concise, factual, and time-stamped.
Minimum communication pattern:
- Initial alert: what is broken, who is affected, what we are doing
- Cadence updates: every
xminutes per severity policy - Stabilization update: system is contained and monitored
- Resolution update: incident ended, follow-up timing shared
Never communicate fake certainty. State uncertainty explicitly and update as facts improve.
Status Update Template
Use a standard format so updates remain scannable under pressure:
[14:20 UTC] SEV-1 checkout incident
Impact: ~38% checkout failures in web app
Current hypothesis: recent release increased payment timeout rate
Actions in progress: rollback owned by Ameer, gateway check owned by Sara
Next update: 14:30 UTC
The point is not polish. The point is keeping the room aligned on impact, owner, and next decision.
FIX STRATEGY UNDER PRESSURE
Choose one strategy deliberately:
- Roll back: safest if recent change likely caused incident
- Fix forward: good when rollback is unsafe or unavailable
- Degrade gracefully: preserve core user value while reducing load
Decision rule:
- Pick the path with fastest safe reduction in user impact
- Preserve optionality for next move
- Document why the chosen path won
Incidents are stabilized by control, not cleverness.
War-Story Mini-Case: 47-Minute Checkout Incident
Timeline:
T+0m (14:05): Checkout error rate spikes to38%; payment authorization timeouts increase.T+6m: Three engineers attempt separate fixes in parallel; one partial deploy increases error rate to44%.T+9m: One Incident Commander is assigned; ad-hoc deploys are frozen.T+12m: Command loop starts (impact -> hypothesis -> action -> owner -> ETA).T+18m: Highest-risk release from last 30 minutes is rolled back.T+27m: Error rate drops below8%; traffic recovery confirmed by synthetic checks.T+47m: Incident closed after two stable update intervals.
Key decisions:
- Chose rollback over fix-forward because blast radius was growing.
- Enforced one command channel to prevent contradictory actions.
- Prioritized user-impact reduction before root-cause deep dive.
Outcome:
- System stabilized
19minutes after command discipline was established. - Follow-up added release guardrails and on-call command-role checklist.
POST-INCIDENT REVIEW CHECKLIST
Every major review should answer:
- What user impact lasted the longest?
- Which assumptions were wrong?
- Which signals were missing or too noisy?
- Which action item prevents recurrence versus only improving response speed?
- Which runbook, alert, or ownership gap made the incident worse?
If the review produces only "monitor more closely," the learning is incomplete.
OUTPUT ARTIFACT
Every major incident should produce:
- Incident timeline with key decisions
- Root cause and contributing factors
- Corrective actions with owner and due date
- Runbook and alert updates
- Follow-up review meeting notes
If post-incident outputs are weak, the same incident returns in a new form.