High Availability, Scaling & Reliability Engineering

HIGH AVAILABILITY (HA) MINDSET

High availability doesn’t mean “system never goes down.”

It means:

“The system continues to operate well enough even when parts of it are broken.”

Availability is usually expressed as “nines”:

99% → ~3.65 days/year downtime
99.9% → ~8.76 hours/year
99.99% → ~52.6 minutes/year
99.999% → ~5.26 minutes/year

Higher nines = exponentially more cost + complexity.

Key HA principles:

Remove single points of failure (SPOFs)
Use redundancy (instances, zones, regions)
Design for failure (degraded modes)
Detect problems fast (observability)
Recover automatically (self-healing patterns)

SCALING PATTERNS

1. Vertical vs Horizontal Scaling

Vertical (scale up)

More CPU, more RAM on single machine
Simple, but limited ceiling

Horizontal (scale out)

More machines
Needs load balancing, statelessness, coordination
Harder, but scales much further

Real-world systems: almost always rely on horizontal scaling.

2. Stateless vs Stateful

Stateless services:

no user-specific data stored in memory between requests
easy to scale horizontally
can be killed/rescheduled freely

Stateful services:

keep in-memory state (sessions, in-flight jobs, etc.)
harder to scale and relocate

Top-tier engineers aggressively push state to databases, caches, queues, so services can be stateless.

3. Read vs Write Scaling

Reads scale via:
- replicas
- caching
- CDNs
- denormalized views
Writes scale via:
- sharding
- queues
- batching
- careful schema design

4. Common Scaling Breakpoints

You typically need to change architecture when:

Traffic overloads a single DB → introduce read replicas & caching
DB writes saturate → shard or queue writes
Single app instance can’t handle traffic → scale horizontally behind a load balancer
Cron jobs take too long → move to distributed workers
A “God Service” knows too much → split by bounded contexts

LOAD BALANCING

Load balancers are traffic routers and safety valves.

They:

distribute traffic across instances
enable zero-downtime deploys
support health checks
provide TLS termination
can apply rate limits & routing rules

Types of load balancing:

Round Robin

Every new request goes to next server.
Least Connections

Send traffic to server with fewest active connections.
IP Hash

Same client often goes to same backend → useful for sticky scenarios.
Weighted

Some services get more traffic (e.g., newer version with 10%, old with 90%).

Global vs Local LB

Global: across regions (e.g., Cloudflare, Route 53 routing)
Local: inside a region/VPC (e.g., Nginx, ALB, Envoy)

OBSERVABILITY (LOGS, METRICS, TRACES)

Observability is:

“Can we understand what the system is doing just by looking from the outside?”

1. Logs

Text-based record of events.

Good logs:

structured (JSON)
include correlation/request IDs
include context (user_id, application_id, etc.)

Bad logs:

random text
no structure
missing context
too noisy

2. Metrics

Numeric time-series data:

request_count
error_rate
p95_latency
CPU, memory, disk, queue depth

Metrics answer:

“Is the system healthy right now?”
“Did something change at 10:05?”

3. Traces

End-to-end view of a single request across multiple services.

Each “span” represents a segment of the journey:

API gateway
service A
DB query
call to service B

Traces answer:

“Where exactly is the request slow?”
“Which service is faulty in this chain?”

4. The Golden Signals (Google SRE)

For each service, monitor:

Latency — how long requests take
Traffic — how many requests
Errors — how many fail
Saturation — how close to capacity

These 4 alone can diagnose most production issues.

ALERTING & ON-CALL

Observability without alerting is passive.

Alerting must be:

actionable
not too noisy
tied to user impact, not every metric blip

Bad alert:

“CPU > 70%” (no context)

Good alert:

“p95 latency > 1.5s for 5 minutes on checkout service”
“error rate > 2% on payment API”

Top engineers design alerts to catch real issues with real impact, not random noise.

SLO, SLI, ERROR BUDGETS (GOOGLE SRE MODEL)

SLI — Service Level Indicator

A metric that represents user experience

(e.g., “percentage of requests under 500ms and 2xx/3xx”).

SLO — Service Level Objective

The target threshold for the SLI.

Example:

“99.9% of requests succeed in < 500ms over 30 days.”

SLA — Service Level Agreement

The contract with customers (with consequences).

Error Budget

If your SLO is 99.9%, your error budget is 0.1% of bad requests/time.

Error budget answers:

“How much failure is acceptable within this period?”

If you burn through error budget:

slow down deployments
invest in reliability
reduce risky changes

This drives responsible engineering.

DEPLOYMENT ARCHITECTURE

Modern systems usually follow some combination of:

Docker containers
Orchestrator (Kubernetes / ECS / Nomad)
CI/CD pipeline
Canary or blue/green deploys

Deployment Patterns:

Rolling Deploy

Replace instances gradually.

Pros: simple.

Cons: some users may see mixed versions.
Blue/Green Deploy

Blue = current version

Green = new version

Switch traffic when green is ready.

Easy rollback: send traffic back to blue.
Canary Deploy

Send 1–10% traffic to new version.

If stable → increase.

If not → rollback quickly.

Top engineers design safe rollout strategies, not YOLO deploys.

RESILIENCE PATTERNS IN PRODUCTION

You’ve seen many of these already, but as a reliability checklist:

Retries with exponential backoff
Circuit breakers
Timeouts
Load shedding
Throttling / rate limiting
Graceful degradation (partial functionality instead of total failure)
Fallback responses (cached / approximate)

Example of graceful degradation:

If recommendation service fails → show popular items instead of error
If analytics backend is down → buffer instead of blocking user actions

CHAOS ENGINEERING (THE HARDCORE STUFF)

Chaos engineering is:

“Intentionally breaking things in production-like environments to ensure your system is resilient.”

Examples:

kill random instances
inject latency
drop packets
simulate region failures

Used by:

Netflix (Chaos Monkey, Chaos Gorilla)
Amazon
many large-scale players

Purpose:

discover weaknesses before real incidents
build muscle memory for incident response

INCIDENT RESPONSE & POSTMORTEMS

High-performing orgs treat incidents as learning opportunities, not blame games.

During incident:

declare incident
assign Incident Commander
assign communication owner
keep timeline/logs
focus on mitigation first, root cause later

After incident:

write postmortem (blameless)
capture:
- what happened
- impact
- detection
- timeline
- root causes
- contributing factors
- what worked / didn’t
- action items

This is how engineering organizations get stronger over time.

PUTTING IT ALL TOGETHER (YOUR MENTAL MODEL)

When you design or evaluate a system now, your brain should think like this:

What are the requirements & constraints?
What are the hot paths?
What is the data model & invariants?
What is the simplest architecture that satisfies scale?
Where are the boundaries? (services, domains, APIs)
How do we handle failures?
How do we observe behavior? (logs, metrics, traces)
What are SLOs and error budgets?
How will it evolve in 1–3 years?
How can we simplify while keeping power?

That is a Staff+ systems mindset.

PART III SUMMARY

You now have a full mental toolbox for system design:

Data modeling, indexing, sharding, consistency
Caching, CDNs, queues, streams, backpressure
API contracts, idempotency, auth, versioning
Distributed systems: consensus, replication, sagas, CRDTs
High availability, scaling, load balancing
Observability, SLO/SLI, reliability, incident response

This is the core of being a top-1% systems engineer.

HIGH AVAILABILITY (HA) MINDSET​

Key HA principles:​

SCALING PATTERNS​

1. Vertical vs Horizontal Scaling​

2. Stateless vs Stateful​

3. Read vs Write Scaling​

4. Common Scaling Breakpoints​

LOAD BALANCING​

Types of load balancing:​

Global vs Local LB​

OBSERVABILITY (LOGS, METRICS, TRACES)​

1. Logs​

2. Metrics​

3. Traces​

4. The Golden Signals (Google SRE)​

ALERTING & ON-CALL​

Bad alert:​

Good alert:​

SLO, SLI, ERROR BUDGETS (GOOGLE SRE MODEL)​

SLI — Service Level Indicator​

SLO — Service Level Objective​

SLA — Service Level Agreement​

Error Budget​

DEPLOYMENT ARCHITECTURE​

Deployment Patterns:​

RESILIENCE PATTERNS IN PRODUCTION​

CHAOS ENGINEERING (THE HARDCORE STUFF)​

INCIDENT RESPONSE & POSTMORTEMS​

During incident:​

After incident:​

PUTTING IT ALL TOGETHER (YOUR MENTAL MODEL)​

PART III SUMMARY​