RELIABILITY IS DESIGNING FOR NORMAL FAILURE

Production is partial failure:

timeouts
dependency brownouts
retries
backpressure

Senior fullstack means the UI and backend cooperate to fail safely.

The cost of unreliability:

User trust — One bad outage erases months of goodwill. Users remember.
SLA breaches — Downtime credits, contractual penalties, enterprise churn.
Revenue — Every second of outage costs money. E‑commerce, ads, SaaS all bleed.
On-call burnout — Unreliable systems mean 3 a.m. pages. Engineers quit.

Quantify it:

Track: MTTR (mean time to recover), error budget consumption, incident frequency.
Goal: reduce surprise. Predictable failure modes beat rare catastrophic ones.
Error budget: if SLA is 99.9%, you have 0.1% failure budget per month. Spend it on deployments, experiments.

Senior rule:

Reliability is not a feature. It is the table stakes. Design for normal failure, or failure will design you.

TIMEOUTS + RETRIES + BACKOFF (THE TRINITY)

Rules:

always set timeouts
retry only idempotent operations (or make them idempotent)
exponential backoff + jitter

Avoid:

synchronized retries (retry storm)

Exponential backoff with jitter (pseudocode):

attempt = 0
max_attempts = 5
base_delay = 100ms
max_delay = 30s

while attempt < max_attempts:
    try:
        result = call_dependency()
        return result
    except RetryableError:
        attempt += 1
        if attempt >= max_attempts:
            raise
        delay = min(base_delay * 2^attempt, max_delay)
        jitter = random(0, delay * 0.2)   // 20% jitter
        sleep(delay + jitter)

Retry policies by operation type:

Operation Type	Retry?	Max Attempts	Backoff
Read (GET)	Yes	3–5	Exponential + jitter
Write (POST)	Only if idempotent	2–3	Exponential + jitter
Payment/Charge	No (or with idempotency key)	1	N/A
Search/Query	Yes	3–5	Exponential + jitter
Webhook outbound	Yes	3–5	Exponential + jitter

Retryable vs non-retryable errors:

Retry: 5xx, timeout, connection refused, 429 (with backoff).
Do not retry: 4xx (except 429), validation errors, auth failures.

Make writes idempotent for retries:

Idempotency key: client sends Idempotency-Key: <uuid> with POST.
Server: first request with key → process and cache result; duplicate → return cached result.
Required for: payments, order creation, webhook delivery.

Retry budgets:

Cap total retries across all clients. Example: 20% of requests may retry.
Prevents a failing dependency from amplifying load via retries.
When budget exhausted: fail fast, surface error, stop hammering.
Implementation: track retry count per time window; when budget exceeded, reject new retries until window resets.

Timeout guidelines:

Set timeouts at every boundary: HTTP client, DB connection, queue receive.
Chain rule: caller timeout > callee timeout. Otherwise you wait for nothing.
Typical: 2–5s for fast APIs, 30s for heavy operations, 60s+ for batch jobs.
Never use infinite timeouts. A hung dependency should eventually fail, not block forever.

Senior rule:

Retries without backoff and jitter turn a single failure into a DDoS on your own dependency.

CIRCUIT BREAKERS + BULKHEADS

circuit breaker stops calling a failing dependency
bulkheads isolate resources so one failure doesn't sink the system

UI equivalent:

degrade features instead of blocking the whole app

Circuit breaker state diagram:

    ┌─────────┐   failure threshold   ┌─────────┐
    │ CLOSED  │ ────────────────────► │  OPEN   │
    │ (normal)│                        │ (fail   │
    └─────────┘                        │  fast)  │
         ▲                             └────┬────┘
         │                                  │
         │  success                         │ timeout
         │                                  ▼
    ┌────┴────┐                       ┌─────────────┐
    │ HALF-OPEN│ ◄────────────────────│             │
    │ (probe)  │   allow 1 test call  │             │
    └─────────┘                       └─────────────┘

Concrete example — payment service circuit breaker:

Closed: All payment requests go through.
Open: After 5 failures in 10s, stop calling payment service. Return "Payment temporarily unavailable" immediately.
Half-open: After 30s, allow 1 test request. Success → closed. Failure → open again.

UI circuit breaker equivalent:

If recommendations API fails 3 times, hide the recommendations block. Show "Suggestions temporarily unavailable."
Don't block the whole page. Isolate the failing component.

Bulkhead pattern — pool isolation:

┌─────────────────────────────────────────────────────────┐
│                    APPLICATION                          │
├─────────────┬─────────────┬─────────────┬───────────────┤
│ Pool A      │ Pool B      │ Pool C      │ Pool D        │
│ (payments)  │ (search)    │ (notify)    │ (analytics)   │
│ max 10 conn │ max 20 conn │ max 5 conn  │ max 15 conn  │
└─────────────┴─────────────┴─────────────┴───────────────┘

One saturated dependency (e.g., payments) cannot exhaust connections for others.
Search, notify, analytics keep working when payments is slow or down.

Circuit breaker configuration (typical):

Failure threshold: 5 failures in 10s
Open duration: 30s before half-open
Half-open: allow 1–3 test requests; success → closed, failure → open

Senior rule:

Circuit breakers protect the caller. Bulkheads protect the system. Use both.

RATE LIMITING + BACKPRESSURE

apply limits per user/tenant
protect expensive endpoints
apply queue depth limits

Signal backpressure clearly:

429 with retry-after
typed error response

Rate limiting algorithms:

Algorithm	How it works	Use case
Token bucket	Refill tokens at fixed rate	API throttling, burst allow
Sliding window	Count requests in rolling window	Per-minute limits
Fixed window	Reset at interval boundary	Simple quotas
Leaky bucket	Process at fixed rate	Smooth output rate

Token bucket (conceptual):

Bucket capacity: 100 tokens
Refill rate: 10 tokens/sec

Request arrives → consume 1 token
If tokens > 0: allow
If tokens = 0: reject with 429, Retry-After: X

Sliding window:

Track request timestamps. Count requests in last 60 seconds (rolling).
If count >= limit, reject. More accurate than fixed window (no burst at boundary).
Trade-off: more memory (store timestamps) vs. fixed window (simpler, bursty).

Where to rate limit:

API gateway: per-user, per-IP, per-API-key.
Service mesh: per-service, per-endpoint.
Application: per-tenant for multi-tenant SaaS.

Backpressure propagation:

Queue depth limit: When queue exceeds N, reject new work. Return 503 or 429.
Client behavior: Honor Retry-After, implement exponential backoff.
Cascading: Each layer propagates "I'm overloaded" upward.
Load shedding: When overloaded, reject low-priority requests first. Preserve critical path.
Queue depth → reject: When worker queue exceeds N (e.g., 1000), API returns 503. Client backs off.

Response format for backpressure:

{
  "error": "rate_limited",
  "retry_after_seconds": 60,
  "message": "Too many requests. Retry after 60 seconds."
}

Include Retry-After header. Clients can parse and sleep.
Use consistent error codes: 429 for rate limit, 503 for overload.

Senior rule:

Backpressure without clear signals (429, Retry-After) forces clients to guess. Guessing causes thundering herds.

POISON MESSAGES + DLQs

Async systems must assume:

malformed payloads
unprocessable messages
repeated failures

Pattern:

limited retries
then DLQ
then replay tooling

Worker rule:

Consumers must be idempotent because delivery is at-least-once.

Async flow: queue → consumer → DLQ → replay

Visibility tooling requirements:

DLQ depth dashboard — alert when depth > 0
Message inspection — view payload, headers, error reason
Replay by ID or batch — reprocess after code/schema fix
Metrics — messages per second to DLQ, by error type

Why messages become poison:

Schema change: producer upgraded, consumer expects old format.
Bad data: null where non-null expected, invalid enum, malformed JSON.
External failure: downstream API rejects; retries exhausted.
Ordering: message processed out of order; depends on prior message not yet available.

Alerting on DLQ depth:

Critical: DLQ depth > 10 for 5 minutes — page on-call
Warning: DLQ depth > 0 for 15 minutes — Slack alert
Info: DLQ receives any message — log for trending

Operational tips:

Set max receive count (e.g., 3) before DLQ. Don't retry forever.
Preserve original message metadata on DLQ for debugging.
Document replay runbook: when to replay, who approves, how to verify.
Consider separate DLQs by error type (schema vs. downstream vs. timeout) for targeted replay.

Senior rule:

A DLQ without replay tooling is a graveyard. Build replay before you need it.

ROLLOUTS THAT DON'T HURT

feature flags
canary deployments
progressive delivery
fast rollback

Senior rule:

Rollback is part of the design, not an emergency improvisation.

Feature flag types:

Type	Purpose	Example
Release	Ship code behind flag, flip when ready	New checkout flow
Experiment	A/B test	Pricing page variant
Ops	Kill switch for risky feature	Disable external API
Permission	Role-based access	Beta feature for admins

Flag hygiene:

Remove flags after rollout. Stale flags create technical debt.
Default to off for new flags. Explicitly enable for safe rollout.

Canary deployments — traffic percentages:

Start: 1% → new version
Observe: error rate, latency, business metrics
Ramp: 5% → 10% → 25% → 50% → 100%
Rollback: any metric degradation → revert to 0%
Hold at each step for 5–15 minutes. Latent bugs surface over time.

Blue-green deployments:

┌─────────┐                    ┌─────────┐
│  Blue   │ ◄─── traffic ─────► │  Green  │
│ (live)  │   switch           │ (new)   │
└─────────┘                    └─────────┘

Two identical environments. Deploy to inactive (green). Switch traffic in one cut.
Rollback: switch traffic back to blue. Instant.
Caveat: DB migrations must be backward-compatible. Both versions run against same DB during switch.

Progressive delivery with metrics gates:

Auto-promote only if: error rate < 0.1%, p99 latency < 500ms, no critical alerts.
Manual gates: pause at 10%, 50% for human review.
Automated rollback: if gate fails, revert automatically.

Rollout strategy comparison:

Strategy	Rollback speed	Resource cost	Complexity
Big bang	Slow	Low	Low
Blue-green	Instant	2x	Medium
Canary	Fast	1x + routing	Medium
Progressive	Fast	1x + gates	High

Rollback checklist:

One-command rollback. No manual steps.
Database migrations: support both old and new code during rollback window.
Feature flags: instant kill switch for risky features.

Senior rule:

Canary without metrics gates is a hope-based deployment. Gate on real signals.

GRACEFUL DEGRADATION

When dependencies fail, degrade features instead of failing the whole system.

Fallback strategies:

Strategy	When to use	Example
Cached response	Read-heavy, stale OK	Product catalog from cache
Simplified UI	Non-critical feature	Hide recommendations block
Read-only mode	Writes failing	Disable checkout, show catalog
Default value	Optional enrichment	Show "—" instead of score
Stale-while-revalidate	Can show old data	Serve cached, refresh in background

Dependency classification:

Critical: No fallback. Payment, auth, core order flow. Fail loudly if down.
Optional: Can degrade. Recommendations, search suggestions, analytics.
Enrichment: Nice-to-have. Avatars, badges, social counts. Default or hide.

Document the classification:

Maintain a dependency map: service → criticality → fallback.
Review quarterly. Dependencies and criticality change.

┌─────────────────────────────────────────────────────────┐
│                    REQUEST                               │
├─────────────────────────────────────────────────────────┤
│  Critical: Auth, Payment  →  Must succeed or fail fast   │
│  Optional: Search, Recs   →  Fallback to cached/simple   │
│  Enrichment: Avatars      →  Hide or show placeholder    │
└─────────────────────────────────────────────────────────┘

Implementation pattern:

Wrap optional calls in try/catch or fallback chain.
Return degraded result (cached, default, empty) on failure.
Log degradation for observability. Don't silently hide failures forever.
Consider circuit breaker for optional deps: after N failures, skip call and use fallback for a cooldown period.

Senior rule:

Classify dependencies before you need to. Deciding during an outage is too late.

HEALTH CHECKS & READINESS

Liveness vs readiness probes:

Probe	Purpose	Failure action
Liveness	Is the process alive?	Restart container
Readiness	Can it accept traffic?	Remove from load balancer

Liveness: Simple check (e.g., /health returns 200). Fails → orchestrator kills and restarts.
Readiness: Check dependencies (DB, cache, downstream APIs). Fails → stop sending traffic until ready.
Never check heavy dependencies in liveness. A slow DB would kill healthy processes.
Startup probe (K8s): separate probe for slow-starting containers. Gives more time before liveness kicks in.
Prefer failing readiness over accepting traffic that will fail. Better to wait than to error 500.

Dependency health checks:

Readiness should verify: DB connection, cache connection, critical downstream reachable.
If any critical dependency is down, fail readiness. Do not accept traffic.

Startup ordering:

Process starts
Liveness: OK (process is up)
Connect to DB, cache, etc.
Readiness: OK (dependencies ready)
Load balancer adds to pool
Traffic flows

Use slow-start readiness: delay readiness until warm-up (e.g., cache populated, connections open).

Probe configuration (Kubernetes example):

livenessProbe:
  httpGet:
    path: /health
    port: 8080
  initialDelaySeconds: 10
  periodSeconds: 10
  failureThreshold: 3

readinessProbe:
  httpGet:
    path: /ready
    port: 8080
  initialDelaySeconds: 5
  periodSeconds: 5
  failureThreshold: 2

Liveness: /health — process alive. Don't check dependencies here.
Readiness: /ready — DB, cache, critical deps OK. Fail this when overloaded.
Overloaded? Fail readiness. Orchestrator stops sending traffic. Prevents cascade.

Senior rule:

Readiness probes that don't check dependencies send traffic to a process that will fail. Check what matters.

CHAOS ENGINEERING (LITE)

Controlled failure injection to validate resilience.

Simple chaos experiments:

Kill one random instance every 5 minutes.
Add 500ms latency to a dependency.
Return 5xx from a dependency for 1% of requests.
Fill disk on a test node.

Tools:

Tool	Purpose	Complexity
Gremlin	Hosted chaos platform	Medium
LitmusChaos	Kubernetes-native chaos	Medium
Custom scripts	Kill pods, add latency	Low

Start small:

Begin with process kill (pod delete). Validate restart and recovery.
Add latency injection. Validate timeouts and circuit breakers.
Then: network partition, disk fill, CPU stress.
Run in staging first. Production chaos is for mature teams with solid observability.

Game days:

Schedule a 2-hour window. Run chaos in a non-prod or staging.
Roles: chaos runner, observer, incident commander.
Goal: validate runbooks, alerting, and rollback procedures.
Debrief: what broke, what we learned, what to fix.

Experiment design template:

Hypothesis: If [dependency X] fails, [system Y] will [expected behavior].
Blast radius: What is affected? Limit to one service, one region.
Abort criteria: Rollback if error rate > 5%, latency > 2x baseline.
Runbook: Step-by-step to stop the experiment and restore.

Never run chaos in production without:

Explicit approval and scheduled window.
Rollback plan and on-call ready.
Metrics dashboards to observe impact.

Senior rule:

Chaos engineering without a hypothesis is vandalism. Start with: "If X fails, we expect Y to happen."

EXERCISES

Identify endpoints that must be idempotent and how you'll guarantee it.
Design a retry policy (backoff + jitter) for a flaky dependency.
Define DLQ rules and replay process.
Write a rollout plan for a risky change.
Classify your dependencies: critical, optional, enrichment. For each optional one, define a fallback.
Design liveness and readiness probes for a service that depends on DB, cache, and one external API.
Propose a chaos experiment for your system. State the hypothesis and expected outcome.
For a service you own: list its dependencies, classify each, and define fallbacks for optional ones.
Design a retry budget for an API that calls 3 downstream services. What percentage would you allow? How would you enforce it?
Sketch a circuit breaker for a recommendation service. What failure threshold? What open duration? When would you go half-open?
Draw the async flow for a message that becomes poison. Include: main queue, consumer, retries, DLQ, replay. What happens at each step?

🏁 END — RELIABILITY PATTERNS