Part VI (e) - Reliability Game Days & Failure Drills
HARD TRUTH: RELIABILITY CLAIMS WITHOUT DRILLS ARE FICTION
Teams often believe they are resilient because diagrams look good.
Real reliability appears only when systems fail and teams respond under pressure.
Game days are where architecture and human coordination are stress-tested before real incidents do it for you.
BUILD A DRILL CATALOG
Maintain a rotating set of scenarios:
- Region failover
- Primary database degradation
- Message queue backlog growth
- Third-party API outage
- Secrets or credential rotation failure
Each scenario should map to known critical paths and business impact tiers.
Failure pattern: stale drill catalogs produce false confidence.
DRILL DESIGN STANDARD
Every drill should define:
- Scope and target systems
- Hypothesis being tested
- Expected detection signals
- Stopping conditions for safety
- Success and failure criteria
Run drills in controlled windows, with clear blast-radius limits.
Field rule: the objective is controlled learning, not reckless chaos.
EXECUTION PROTOCOL
Execute drills like real incidents:
- Assign incident roles
- Use a dedicated command channel
- Keep a decision timeline
- Record action owner and ETA
- Validate runbooks and dashboards live
Do not skip communication "because it is a drill." That is exactly where real incidents later fail.
SCORING MODEL
Score each drill on:
- Time to detect
- Time to mitigate
- Time to recover
- User impact observed
- Communication quality
Add a simple grade:
- Green: ready
- Yellow: partial readiness
- Red: high operational risk
Without scoring, drills become theatre.
IMPROVEMENT LOOP
After each drill:
- Capture top 3 systemic gaps
- Assign owners and deadlines
- Track closure in reliability backlog
- Re-run critical scenario to verify fix
Repeatability matters more than one good day.
War-Story Mini-Case: DR Plan Failed on First Drill
Timeline:
T+0m: Planned region failover drill begins on staging-production shadow traffic.T+4m: Health checks fail over correctly, but public traffic remains pinned due to stale DNS records.T+11m: Team discovers documented TTL (30s) differs from actual (300s) at provider layer.T+19m: Manual override runbook step is missing; escalation path invoked.T+42m: Service fully restored, but drill gradedRedfor recovery-time objective miss.T+14 days: Drill re-run after fixes; full recovery in11m, gradedGreen.
Key decisions:
- Treated the drill as a real incident with role discipline and timeline logging.
- Updated runbook with exact operator commands, not high-level instructions.
- Added DNS propagation verification to mandatory pre-failover checks.
Outcome:
- Recovery window dropped from
42mto11m. - Reliability confidence moved from assumed to demonstrated.
OUTPUT ARTIFACT
Quarterly reliability package:
- Game day calendar
- Drill scorecards
- Reliability gap backlog
- Closure status from previous quarter
This turns reliability from reactive culture into operating discipline.