Part IV (e) - Incident Command & On-Call Engineering

HARD TRUTH: INCIDENTS EXPOSE REAL LEADERSHIP

When production is unstable, technical skill matters, but leadership quality matters more.

Teams that perform well in incidents do three things consistently:

Establish command quickly
Communicate clearly
Make reversible decisions under pressure

Incident maturity is not heroics. It is repeatable system behavior under pressure.

SEVERITY MODEL

Define SEV-1 to SEV-4 using objective criteria:

User impact percentage
Revenue or operational impact
Time sensitivity
Compliance/security implications

Each severity should map to:

Maximum response start time
Required roles
Required stakeholder groups
Update cadence

Failure pattern: vague severities turn incidents into political debate while customers are still down.

Example Severity Policy

Severity	Example Impact	Response Start	Update Cadence	Required Roles
`SEV-1`	Core revenue path or login unavailable for most users	`5 min`	Every `10 min`	Commander, Comms, Ops, Scribe
`SEV-2`	Major feature degraded or large region affected	`15 min`	Every `15 min`	Commander, Ops, Scribe
`SEV-3`	Partial feature degradation with workaround	`30 min`	Every `30 min`	Primary on-call, owner
`SEV-4`	Low-risk issue or internal degradation only	Business hours	As needed	Service owner

Use examples, not vague adjectives, when defining severity. "Checkout error rate above 20% for 10 minutes" is usable. "Serious outage" is not.

INCIDENT ROLES

For medium and high severity incidents, assign explicit roles:

Incident Commander: owns decisions and direction
Communications Lead: owns internal and external updates
Ops Lead: coordinates mitigation execution
Scribe: maintains timeline and decision log

Rules:

One channel, one command
No major action without owner and ETA
Avoid role confusion by naming roles in channel topic

Field rule: clear roles reduce chaos faster than clever debugging.

COMMAND LOOP (EVERY 10-15 MINUTES)

Run a strict loop:

Current impact and scope
Best hypothesis
Actions in progress
Decision needed now
Next update time

This loop keeps everyone synchronized and prevents tunnel vision.

Do not allow uncontrolled parallel experiments. They add hidden risk and destroy timeline clarity.

COMMS DISCIPLINE

Strong incident communication is concise, factual, and time-stamped.

Minimum communication pattern:

Initial alert: what is broken, who is affected, what we are doing
Cadence updates: every x minutes per severity policy
Stabilization update: system is contained and monitored
Resolution update: incident ended, follow-up timing shared

Never communicate fake certainty. State uncertainty explicitly and update as facts improve.

Status Update Template

Use a standard format so updates remain scannable under pressure:

[14:20 UTC] SEV-1 checkout incident
Impact: ~38% checkout failures in web app
Current hypothesis: recent release increased payment timeout rate
Actions in progress: rollback owned by Ameer, gateway check owned by Sara
Next update: 14:30 UTC

The point is not polish. The point is keeping the room aligned on impact, owner, and next decision.

FIX STRATEGY UNDER PRESSURE

Choose one strategy deliberately:

Roll back: safest if recent change likely caused incident
Fix forward: good when rollback is unsafe or unavailable
Degrade gracefully: preserve core user value while reducing load

Decision rule:

Pick the path with fastest safe reduction in user impact
Preserve optionality for next move
Document why the chosen path won

Incidents are stabilized by control, not cleverness.

War-Story Mini-Case: 47-Minute Checkout Incident

Timeline:

T+0m (14:05): Checkout error rate spikes to 38%; payment authorization timeouts increase.
T+6m: Three engineers attempt separate fixes in parallel; one partial deploy increases error rate to 44%.
T+9m: One Incident Commander is assigned; ad-hoc deploys are frozen.
T+12m: Command loop starts (impact -> hypothesis -> action -> owner -> ETA).
T+18m: Highest-risk release from last 30 minutes is rolled back.
T+27m: Error rate drops below 8%; traffic recovery confirmed by synthetic checks.
T+47m: Incident closed after two stable update intervals.

Key decisions:

Chose rollback over fix-forward because blast radius was growing.
Enforced one command channel to prevent contradictory actions.
Prioritized user-impact reduction before root-cause deep dive.

Outcome:

System stabilized 19 minutes after command discipline was established.
Follow-up added release guardrails and on-call command-role checklist.

POST-INCIDENT REVIEW CHECKLIST

Every major review should answer:

What user impact lasted the longest?
Which assumptions were wrong?
Which signals were missing or too noisy?
Which action item prevents recurrence versus only improving response speed?
Which runbook, alert, or ownership gap made the incident worse?

If the review produces only "monitor more closely," the learning is incomplete.

OUTPUT ARTIFACT

Every major incident should produce:

Incident timeline with key decisions
Root cause and contributing factors
Corrective actions with owner and due date
Runbook and alert updates
Follow-up review meeting notes

If post-incident outputs are weak, the same incident returns in a new form.

HARD TRUTH: INCIDENTS EXPOSE REAL LEADERSHIP

SEVERITY MODEL​

Example Severity Policy​

INCIDENT ROLES​

COMMAND LOOP (EVERY 10-15 MINUTES)​

COMMS DISCIPLINE​

Status Update Template​

FIX STRATEGY UNDER PRESSURE​

War-Story Mini-Case: 47-Minute Checkout Incident​

POST-INCIDENT REVIEW CHECKLIST​

OUTPUT ARTIFACT​