π CASE STUDY β Part VI: Disaster Recovery Game Day (Multi-Region Failover)
SECTION 0 β SCENARIO
Your primary region is down.
You have 30 minutes to restore service.
This is not a cloud question. Itβs a systems + operations question.
SECTION 1 β DEFINE YOUR SLO + RTO/RPO
-
RTO: how fast you must recover
-
RPO: how much data loss is acceptable
Senior rule:
Multi-region is meaningless without explicit RTO/RPO.
SECTION 2 β ARCHITECTURE OPTIONS
-
active-passive
-
active-active
Tradeoff:
- active-active costs more but reduces RTO
SECTION 3 β THE GAME DAY PLAN
Checklist:
-
simulate primary region outage
-
validate DNS / traffic shift
-
validate DB replication and promotion
-
validate secrets/config in secondary
-
validate background workers resume safely
SECTION 4 β COMMON FAILURE MODES
-
DNS TTL too high β traffic shift slow
-
workers double-run jobs β idempotency missing
-
stale secrets in secondary
-
database replication lag β violates RPO
SECTION 5 β OBSERVABILITY
-
replication lag metrics
-
synthetic probes per region
-
runbook with decision points
SECTION 6 β EXERCISE
Write your failover runbook:
-
who decides?
-
what triggers failover?
-
how do you validate correctness after failover?