High Availability, Scaling & Reliability Engineering
SECTION 1 — HIGH AVAILABILITY (HA) MINDSET
High availability doesn’t mean “system never goes down.”
It means:
“The system continues to operate well enough even when parts of it are broken.”
Availability is usually expressed as “nines”:
-
99% → ~3.65 days/year downtime
-
99.9% → ~8.76 hours/year
-
99.99% → ~52.6 minutes/year
-
99.999% → ~5.26 minutes/year
Higher nines = exponentially more cost + complexity.
Key HA principles:
-
Remove single points of failure (SPOFs)
-
Use redundancy (instances, zones, regions)
-
Design for failure (degraded modes)
-
Detect problems fast (observability)
-
Recover automatically (self-healing patterns)
SECTION 2 — SCALING PATTERNS
1. Vertical vs Horizontal Scaling
Vertical (scale up)
-
More CPU, more RAM on single machine
-
Simple, but limited ceiling
Horizontal (scale out)
-
More machines
-
Needs load balancing, statelessness, coordination
-
Harder, but scales much further
Real-world systems: almost always rely on horizontal scaling.
2. Stateless vs Stateful
Stateless services:
-
no user-specific data stored in memory between requests
-
easy to scale horizontally
-
can be killed/rescheduled freely
Stateful services:
-
keep in-memory state (sessions, in-flight jobs, etc.)
-
harder to scale and relocate
Top-tier engineers aggressively push state to databases, caches, queues, so services can be stateless.
3. Read vs Write Scaling
-
Reads scale via:
-
replicas
-
caching
-
CDNs
-
denormalized views
-
-
Writes scale via:
-
sharding
-
queues
-
batching
-
careful schema design
-
4. Common Scaling Breakpoints
You typically need to change architecture when:
-
Traffic overloads a single DB → introduce read replicas & caching
-
DB writes saturate → shard or queue writes
-
Single app instance can’t handle traffic → scale horizontally behind a load balancer
-
Cron jobs take too long → move to distributed workers
-
A “God Service” knows too much → split by bounded contexts
SECTION 3 — LOAD BALANCING
Load balancers are traffic routers and safety valves.
They:
-
distribute traffic across instances
-
enable zero-downtime deploys
-
support health checks
-
provide TLS termination
-
can apply rate limits & routing rules
Types of load balancing:
-
Round Robin
Every new request goes to next server.
-
Least Connections
Send traffic to server with fewest active connections.
-
IP Hash
Same client often goes to same backend → useful for sticky scenarios.
-
Weighted
Some services get more traffic (e.g., newer version with 10%, old with 90%).
Global vs Local LB
-
Global: across regions (e.g., Cloudflare, Route 53 routing)
-
Local: inside a region/VPC (e.g., Nginx, ALB, Envoy)
SECTION 4 — OBSERVABILITY (LOGS, METRICS, TRACES)
Observability is:
“Can we understand what the system is doing just by looking from the outside?”
1. Logs
Text-based record of events.
Good logs:
-
structured (JSON)
-
include correlation/request IDs
-
include context (user_id, application_id, etc.)
Bad logs:
-
random text
-
no structure
-
missing context
-
too noisy
2. Metrics
Numeric time-series data:
-
request_count
-
error_rate
-
p95_latency
-
CPU, memory, disk, queue depth
Metrics answer:
-
“Is the system healthy right now?”
-
“Did something change at 10:05?”
3. Traces
End-to-end view of a single request across multiple services.
Each “span” represents a segment of the journey:
-
API gateway
-
service A
-
DB query
-
call to service B
Traces answer:
-
“Where exactly is the request slow?”
-
“Which service is faulty in this chain?”
4. The Golden Signals (Google SRE)
For each service, monitor:
-
Latency — how long requests take
-
Traffic — how many requests
-
Errors — how many fail
-
Saturation — how close to capacity
These 4 alone can diagnose most production issues.
SECTION 5 — ALERTING & ON-CALL
Observability without alerting is passive.
Alerting must be:
-
actionable
-
not too noisy
-
tied to user impact, not every metric blip
Bad alert:
- “CPU > 70%” (no context)
Good alert:
-
“p95 latency > 1.5s for 5 minutes on checkout service”
-
“error rate > 2% on payment API”
Top engineers design alerts to catch real issues with real impact, not random noise.
SECTION 6 — SLO, SLI, ERROR BUDGETS (GOOGLE SRE MODEL)
SLI — Service Level Indicator
A metric that represents user experience
(e.g., “percentage of requests under 500ms and 2xx/3xx”).
SLO — Service Level Objective
The target threshold for the SLI.
Example:
“99.9% of requests succeed in < 500ms over 30 days.”
SLA — Service Level Agreement
The contract with customers (with consequences).
Error Budget
If your SLO is 99.9%, your error budget is 0.1% of bad requests/time.
Error budget answers:
“How much failure is acceptable within this period?”
If you burn through error budget:
-
slow down deployments
-
invest in reliability
-
reduce risky changes
This drives responsible engineering.
SECTION 7 — DEPLOYMENT ARCHITECTURE
Modern systems usually follow some combination of:
-
Docker containers
-
Orchestrator (Kubernetes / ECS / Nomad)
-
CI/CD pipeline
-
Canary or blue/green deploys
Deployment Patterns:
-
Rolling Deploy
Replace instances gradually.
Pros: simple.
Cons: some users may see mixed versions.
-
Blue/Green Deploy
Blue = current version
Green = new version
Switch traffic when green is ready.
Easy rollback: send traffic back to blue.
-
Canary Deploy
Send 1–10% traffic to new version.
If stable → increase.
If not → rollback quickly.
Top engineers design safe rollout strategies, not YOLO deploys.
SECTION 8 — RESILIENCE PATTERNS IN PRODUCTION
You’ve seen many of these already, but as a reliability checklist:
-
Retries with exponential backoff
-
Circuit breakers
-
Timeouts
-
Load shedding
-
Throttling / rate limiting
-
Graceful degradation (partial functionality instead of total failure)
-
Fallback responses (cached / approximate)
Example of graceful degradation:
-
If recommendation service fails → show popular items instead of error
-
If analytics backend is down → buffer instead of blocking user actions
SECTION 9 — CHAOS ENGINEERING (THE HARDCORE STUFF)
Chaos engineering is:
“Intentionally breaking things in production-like environments to ensure your system is resilient.”
Examples:
-
kill random instances
-
inject latency
-
drop packets
-
simulate region failures
Used by:
-
Netflix (Chaos Monkey, Chaos Gorilla)
-
Amazon
-
many large-scale players
Purpose:
-
discover weaknesses before real incidents
-
build muscle memory for incident response
SECTION 10 — INCIDENT RESPONSE & POSTMORTEMS
High-performing orgs treat incidents as learning opportunities, not blame games.
During incident:
-
declare incident
-
assign Incident Commander
-
assign communication owner
-
keep timeline/logs
-
focus on mitigation first, root cause later
After incident:
-
write postmortem (blameless)
-
capture:
-
what happened
-
impact
-
detection
-
timeline
-
root causes
-
contributing factors
-
what worked / didn’t
-
action items
-
This is how engineering organizations get stronger over time.
SECTION 11 — PUTTING IT ALL TOGETHER (YOUR MENTAL MODEL)
When you design or evaluate a system now, your brain should think like this:
-
What are the requirements & constraints?
-
What are the hot paths?
-
What is the data model & invariants?
-
What is the simplest architecture that satisfies scale?
-
Where are the boundaries? (services, domains, APIs)
-
How do we handle failures?
-
How do we observe behavior? (logs, metrics, traces)
-
What are SLOs and error budgets?
-
How will it evolve in 1–3 years?
-
How can we simplify while keeping power?
That is a Staff+ systems mindset.
SECTION 12 — PART III SUMMARY
You now have a full mental toolbox for system design:
-
Data modeling, indexing, sharding, consistency
-
Caching, CDNs, queues, streams, backpressure
-
API contracts, idempotency, auth, versioning
-
Distributed systems: consensus, replication, sagas, CRDTs
-
High availability, scaling, load balancing
-
Observability, SLO/SLI, reliability, incident response
This is the core of being a top-1% systems engineer.