Skip to main content

High Availability, Scaling & Reliability Engineering

SECTION 1 — HIGH AVAILABILITY (HA) MINDSET

High availability doesn’t mean “system never goes down.”

It means:

“The system continues to operate well enough even when parts of it are broken.”

Availability is usually expressed as “nines”:

  • 99% → ~3.65 days/year downtime

  • 99.9% → ~8.76 hours/year

  • 99.99% → ~52.6 minutes/year

  • 99.999% → ~5.26 minutes/year

Higher nines = exponentially more cost + complexity.

Key HA principles:

  • Remove single points of failure (SPOFs)

  • Use redundancy (instances, zones, regions)

  • Design for failure (degraded modes)

  • Detect problems fast (observability)

  • Recover automatically (self-healing patterns)


SECTION 2 — SCALING PATTERNS

1. Vertical vs Horizontal Scaling

Vertical (scale up)

  • More CPU, more RAM on single machine

  • Simple, but limited ceiling

Horizontal (scale out)

  • More machines

  • Needs load balancing, statelessness, coordination

  • Harder, but scales much further

Real-world systems: almost always rely on horizontal scaling.


2. Stateless vs Stateful

Stateless services:

  • no user-specific data stored in memory between requests

  • easy to scale horizontally

  • can be killed/rescheduled freely

Stateful services:

  • keep in-memory state (sessions, in-flight jobs, etc.)

  • harder to scale and relocate

Top-tier engineers aggressively push state to databases, caches, queues, so services can be stateless.


3. Read vs Write Scaling

  • Reads scale via:

    • replicas

    • caching

    • CDNs

    • denormalized views

  • Writes scale via:

    • sharding

    • queues

    • batching

    • careful schema design


4. Common Scaling Breakpoints

You typically need to change architecture when:

  1. Traffic overloads a single DB → introduce read replicas & caching

  2. DB writes saturate → shard or queue writes

  3. Single app instance can’t handle traffic → scale horizontally behind a load balancer

  4. Cron jobs take too long → move to distributed workers

  5. A “God Service” knows too much → split by bounded contexts


SECTION 3 — LOAD BALANCING

Load balancers are traffic routers and safety valves.

They:

  • distribute traffic across instances

  • enable zero-downtime deploys

  • support health checks

  • provide TLS termination

  • can apply rate limits & routing rules

Types of load balancing:

  1. Round Robin

    Every new request goes to next server.

  2. Least Connections

    Send traffic to server with fewest active connections.

  3. IP Hash

    Same client often goes to same backend → useful for sticky scenarios.

  4. Weighted

    Some services get more traffic (e.g., newer version with 10%, old with 90%).

Global vs Local LB

  • Global: across regions (e.g., Cloudflare, Route 53 routing)

  • Local: inside a region/VPC (e.g., Nginx, ALB, Envoy)


SECTION 4 — OBSERVABILITY (LOGS, METRICS, TRACES)

Observability is:

“Can we understand what the system is doing just by looking from the outside?”

1. Logs

Text-based record of events.

Good logs:

  • structured (JSON)

  • include correlation/request IDs

  • include context (user_id, application_id, etc.)

Bad logs:

  • random text

  • no structure

  • missing context

  • too noisy


2. Metrics

Numeric time-series data:

  • request_count

  • error_rate

  • p95_latency

  • CPU, memory, disk, queue depth

Metrics answer:

  • “Is the system healthy right now?”

  • “Did something change at 10:05?”


3. Traces

End-to-end view of a single request across multiple services.

Each “span” represents a segment of the journey:

  • API gateway

  • service A

  • DB query

  • call to service B

Traces answer:

  • “Where exactly is the request slow?”

  • “Which service is faulty in this chain?”


4. The Golden Signals (Google SRE)

For each service, monitor:

  1. Latency — how long requests take

  2. Traffic — how many requests

  3. Errors — how many fail

  4. Saturation — how close to capacity

These 4 alone can diagnose most production issues.


SECTION 5 — ALERTING & ON-CALL

Observability without alerting is passive.

Alerting must be:

  • actionable

  • not too noisy

  • tied to user impact, not every metric blip

Bad alert:

  • “CPU > 70%” (no context)

Good alert:

  • “p95 latency > 1.5s for 5 minutes on checkout service”

  • “error rate > 2% on payment API”

Top engineers design alerts to catch real issues with real impact, not random noise.


SECTION 6 — SLO, SLI, ERROR BUDGETS (GOOGLE SRE MODEL)

SLI — Service Level Indicator

A metric that represents user experience

(e.g., “percentage of requests under 500ms and 2xx/3xx”).

SLO — Service Level Objective

The target threshold for the SLI.

Example:

“99.9% of requests succeed in < 500ms over 30 days.”

SLA — Service Level Agreement

The contract with customers (with consequences).


Error Budget

If your SLO is 99.9%, your error budget is 0.1% of bad requests/time.

Error budget answers:

“How much failure is acceptable within this period?”

If you burn through error budget:

  • slow down deployments

  • invest in reliability

  • reduce risky changes

This drives responsible engineering.


SECTION 7 — DEPLOYMENT ARCHITECTURE

Modern systems usually follow some combination of:

  • Docker containers

  • Orchestrator (Kubernetes / ECS / Nomad)

  • CI/CD pipeline

  • Canary or blue/green deploys

Deployment Patterns:

  1. Rolling Deploy

    Replace instances gradually.

    Pros: simple.

    Cons: some users may see mixed versions.

  2. Blue/Green Deploy

    Blue = current version

    Green = new version

    Switch traffic when green is ready.

    Easy rollback: send traffic back to blue.

  3. Canary Deploy

    Send 1–10% traffic to new version.

    If stable → increase.

    If not → rollback quickly.

Top engineers design safe rollout strategies, not YOLO deploys.


SECTION 8 — RESILIENCE PATTERNS IN PRODUCTION

You’ve seen many of these already, but as a reliability checklist:

  • Retries with exponential backoff

  • Circuit breakers

  • Timeouts

  • Load shedding

  • Throttling / rate limiting

  • Graceful degradation (partial functionality instead of total failure)

  • Fallback responses (cached / approximate)

Example of graceful degradation:

  • If recommendation service fails → show popular items instead of error

  • If analytics backend is down → buffer instead of blocking user actions


SECTION 9 — CHAOS ENGINEERING (THE HARDCORE STUFF)

Chaos engineering is:

“Intentionally breaking things in production-like environments to ensure your system is resilient.”

Examples:

  • kill random instances

  • inject latency

  • drop packets

  • simulate region failures

Used by:

  • Netflix (Chaos Monkey, Chaos Gorilla)

  • Amazon

  • many large-scale players

Purpose:

  • discover weaknesses before real incidents

  • build muscle memory for incident response


SECTION 10 — INCIDENT RESPONSE & POSTMORTEMS

High-performing orgs treat incidents as learning opportunities, not blame games.

During incident:

  • declare incident

  • assign Incident Commander

  • assign communication owner

  • keep timeline/logs

  • focus on mitigation first, root cause later

After incident:

  • write postmortem (blameless)

  • capture:

    • what happened

    • impact

    • detection

    • timeline

    • root causes

    • contributing factors

    • what worked / didn’t

    • action items

This is how engineering organizations get stronger over time.


SECTION 11 — PUTTING IT ALL TOGETHER (YOUR MENTAL MODEL)

When you design or evaluate a system now, your brain should think like this:

  1. What are the requirements & constraints?

  2. What are the hot paths?

  3. What is the data model & invariants?

  4. What is the simplest architecture that satisfies scale?

  5. Where are the boundaries? (services, domains, APIs)

  6. How do we handle failures?

  7. How do we observe behavior? (logs, metrics, traces)

  8. What are SLOs and error budgets?

  9. How will it evolve in 1–3 years?

  10. How can we simplify while keeping power?

That is a Staff+ systems mindset.


SECTION 12 — PART III SUMMARY

You now have a full mental toolbox for system design:

  • Data modeling, indexing, sharding, consistency

  • Caching, CDNs, queues, streams, backpressure

  • API contracts, idempotency, auth, versioning

  • Distributed systems: consensus, replication, sagas, CRDTs

  • High availability, scaling, load balancing

  • Observability, SLO/SLI, reliability, incident response

This is the core of being a top-1% systems engineer.