Skip to main content

Part VI (e) - Reliability Game Days & Failure Drills

HARD TRUTH: RELIABILITY CLAIMS WITHOUT DRILLS ARE FICTION

Teams often believe they are resilient because diagrams look good.

Real reliability appears only when systems fail and teams respond under pressure.

Game days are where architecture and human coordination are stress-tested before real incidents do it for you.


BUILD A DRILL CATALOG

Maintain a rotating set of scenarios:

  • Region failover
  • Primary database degradation
  • Message queue backlog growth
  • Third-party API outage
  • Secrets or credential rotation failure

Each scenario should map to known critical paths and business impact tiers.

Failure pattern: stale drill catalogs produce false confidence.


DRILL DESIGN STANDARD

Every drill should define:

  • Scope and target systems
  • Hypothesis being tested
  • Expected detection signals
  • Stopping conditions for safety
  • Success and failure criteria

Run drills in controlled windows, with clear blast-radius limits.

Field rule: the objective is controlled learning, not reckless chaos.


EXECUTION PROTOCOL

Execute drills like real incidents:

  • Assign incident roles
  • Use a dedicated command channel
  • Keep a decision timeline
  • Record action owner and ETA
  • Validate runbooks and dashboards live

Do not skip communication "because it is a drill." That is exactly where real incidents later fail.


SCORING MODEL

Score each drill on:

  • Time to detect
  • Time to mitigate
  • Time to recover
  • User impact observed
  • Communication quality

Add a simple grade:

  • Green: ready
  • Yellow: partial readiness
  • Red: high operational risk

Without scoring, drills become theatre.


IMPROVEMENT LOOP

After each drill:

  • Capture top 3 systemic gaps
  • Assign owners and deadlines
  • Track closure in reliability backlog
  • Re-run critical scenario to verify fix

Repeatability matters more than one good day.


War-Story Mini-Case: DR Plan Failed on First Drill

Timeline:

  • T+0m: Planned region failover drill begins on staging-production shadow traffic.
  • T+4m: Health checks fail over correctly, but public traffic remains pinned due to stale DNS records.
  • T+11m: Team discovers documented TTL (30s) differs from actual (300s) at provider layer.
  • T+19m: Manual override runbook step is missing; escalation path invoked.
  • T+42m: Service fully restored, but drill graded Red for recovery-time objective miss.
  • T+14 days: Drill re-run after fixes; full recovery in 11m, graded Green.

Key decisions:

  • Treated the drill as a real incident with role discipline and timeline logging.
  • Updated runbook with exact operator commands, not high-level instructions.
  • Added DNS propagation verification to mandatory pre-failover checks.

Outcome:

  • Recovery window dropped from 42m to 11m.
  • Reliability confidence moved from assumed to demonstrated.

OUTPUT ARTIFACT

Quarterly reliability package:

  • Game day calendar
  • Drill scorecards
  • Reliability gap backlog
  • Closure status from previous quarter

This turns reliability from reactive culture into operating discipline.