CASE STUDY — Part VI: Disaster Recovery Game Day (Multi-Region Failover)

SCENARIO

Your primary region is down.

You have 30 minutes to restore service.

This is not a cloud question. It's a systems + operations question.

Real-world context: Major cloud outages happen. AWS us-east-1 has had multi-hour outages. Azure and GCP have had region-wide incidents. When your primary region goes dark, customers can't log in, transactions fail, and revenue stops. The business impact is measured in dollars per minute and customer trust. A well-rehearsed failover can cut recovery from hours to minutes. An unrehearsed one turns into chaos: wrong DNS, missing secrets, unreplicated data, and teams arguing over who decides.

Senior rule:

Multi-region is meaningless without explicit RTO/RPO and a practiced game day.

DEFINE YOUR SLO + RTO/RPO

RTO (Recovery Time Objective): How fast you must recover. "We will restore service within 30 minutes."
RPO (Recovery Point Objective): How much data loss is acceptable. "We accept up to 5 minutes of data loss" means you need replication lag ≤ 5 min.

SLO calculation example: If your SLA promises 99.9% uptime, you have ~43 minutes of downtime per month. A single 2-hour outage blows the SLA. Your RTO must be well under 43 minutes to leave room for other incidents.

RPO/RTO tradeoff table:

RPO	RTO	Typical approach	Cost
0 (no loss)	15 min	Synchronous replication, active-active	High
1–5 min	30 min	Async replication, warm standby	Medium
1 hour	1–2 hours	Daily backups, cold standby	Low

Senior rule:

Multi-region is meaningless without explicit RTO/RPO.

ARCHITECTURE OPTIONS

Active-passive: Primary serves all traffic. Secondary has replicated data and standby compute. On failover, you promote secondary to primary and shift traffic. Cheaper (no idle compute in secondary for reads), but RTO is higher—you must start services, promote DB, update DNS.

Active-active: Both regions serve traffic. Data is replicated both ways. On failover, you stop writing to failed region and shift traffic. Lower RTO (secondary is already warm), but higher cost and complexity (conflict resolution, split-brain risk).

Warm standby: A middle ground. Secondary has running DB replica and pre-provisioned (but scaled-down) compute. On failover, you scale up compute and promote DB. RTO between cold and active-active.

ASCII: Active-Passive

                    ┌─────────────────┐
                    │   DNS / LB      │
                    │   (primary)     │
                    └────────┬────────┘
                             │
              ┌──────────────┼──────────────┐
              │              │              │
              ▼              ▼              ▼
     ┌────────────┐  ┌────────────┐  ┌────────────┐
     │  App       │  │  App       │  │  DB        │
     │  (Primary) │  │  (Primary) │  │  (Primary) │
     └────────────┘  └────────────┘  └─────┬──────┘
                                           │ async
                                           │ replicate
                                           ▼
                                    ┌────────────┐
                                    │  DB        │
                                    │  (Standby) │
                                    └────────────┘

ASCII: Active-Active

     ┌─────────────────────────────────────────────┐
     │           Global DNS / Traffic Manager       │
     └──────────────┬──────────────────┬───────────┘
                    │                  │
                    ▼                  ▼
     ┌────────────────────┐  ┌────────────────────┐
     │  Region A          │  │  Region B           │
     │  App + DB (active) │◄─┤  App + DB (active) │
     │                    │  │  bidirectional     │
     └────────────────────┘  └────────────────────┘

Tradeoff: Active-active costs more but reduces RTO. Active-passive is simpler but requires more time to fail over.

THE GAME DAY PLAN

Checklist with timing (example 30-min RTO):

Time	Action	Owner
T+0	Declare incident. War room opens.	Incident commander
T+2	Confirm primary region unreachable (synthetic probes, status page).	SRE
T+5	Decision: execute failover. Notify stakeholders.	Incident commander
T+8	Promote secondary DB to primary. Verify replication caught up.	DBA / SRE
T+12	Validate secrets/config in secondary (DB credentials, API keys).	SRE
T+15	Update DNS / traffic manager to point to secondary.	SRE
T+18	Validate background workers resume safely (no double-run).	Backend
T+22	Synthetic probes green. Smoke test critical paths.	QA / SRE
T+25	All-clear. Post-incident doc started.	Incident commander

Communication protocol: Use a dedicated Slack channel or PagerDuty conference bridge. One person (incident commander) makes decisions. Others execute. No parallel "let me try this" without coordination—that causes split-brain and duplicate work.

Secrets validation: Secondary must have the same secrets as primary (DB passwords, third-party API keys). If secrets are in a vault, ensure secondary region can read them. Rotate any secrets that were primary-only.

COMMON FAILURE MODES

Failure	What goes wrong	How to test
DNS TTL too high	Traffic shift is slow. Clients cache old IPs for hours.	Run `dig` from multiple locations; measure TTL. Set TTL to 60–300s for critical records.
Workers double-run jobs	After failover, both regions think they own the queue. Jobs run twice.	Use distributed locks or a single job queue. Test: simulate failover, verify job count.
Stale secrets in secondary	DB promotion works, but app can't connect—wrong password.	Game day: verify app in secondary can connect to promoted DB before traffic shift.
Database replication lag	Promote before replication catches up. Data loss exceeds RPO.	Monitor `replication_lag_seconds`. In game day, simulate lag and verify you wait before promoting.
Split-brain	Both regions accept writes. Data diverges. Merge is painful.	Design: only one writer for each shard. Use consensus (e.g., etcd) for leader election.
Background jobs in wrong region	Cron in secondary fires for "primary-only" jobs. Duplicates or errors.	Document which jobs run where. Use region-aware job scheduler.

OBSERVABILITY

Metrics to monitor:

Replication lag (seconds behind primary): Alert if > RPO threshold.
Synthetic probes per region: HTTP checks from external locations. Alert if primary fails; use to trigger runbook.
Write throughput in secondary: After promotion, ensure it can handle load.

Runbook excerpt:

## Failover Decision Criteria
- Primary region unreachable for > 5 minutes (synthetic probes)
- OR: Primary DB down, replication lag within RPO
- OR: Executive decision (e.g., security incident)

## Pre-Failover Checks
- [ ] Replication lag < RPO (e.g., 5 min)
- [ ] Secondary DB disk space sufficient
- [ ] Secrets validated in secondary

## Rollback (if failover fails)
- [ ] Revert DNS to primary
- [ ] Demote secondary DB back to replica
- [ ] Investigate before retry

EXERCISE

Write your failover runbook. Include:

Who decides? Name the role (e.g., Incident Commander, SRE Lead). What's the escalation path if they're unavailable?
What triggers failover? List concrete criteria (synthetic probe failure, DB unreachable, manual override). How long do you wait before declaring primary dead?
How do you validate correctness after failover? Which checks prove that reads and writes work? How do you verify no duplicate orders, no missing data?
Rollback plan: If secondary fails under load or has data issues, how do you revert? What's the risk of flipping back to primary that may still be unhealthy?
Post-failover: When do you fail back to primary? What's the process to re-establish replication and shift traffic back?
Data consistency check: After failover, how do you verify no data was lost or corrupted? Consider checksum validation, record counts, or sampling critical tables.
Communication plan: Who notifies customers? Who updates status page? Draft the external communication template for "We are experiencing issues and have activated our secondary region."

🏁 END — PART VI CASE STUDY