Skip to main content

RELIABILITY IS DESIGNING FOR NORMAL FAILURE

Production is partial failure:

  • timeouts
  • dependency brownouts
  • retries
  • backpressure

Senior fullstack means the UI and backend cooperate to fail safely.

The cost of unreliability:

  • User trust — One bad outage erases months of goodwill. Users remember.
  • SLA breaches — Downtime credits, contractual penalties, enterprise churn.
  • Revenue — Every second of outage costs money. E‑commerce, ads, SaaS all bleed.
  • On-call burnout — Unreliable systems mean 3 a.m. pages. Engineers quit.

Quantify it:

  • Track: MTTR (mean time to recover), error budget consumption, incident frequency.
  • Goal: reduce surprise. Predictable failure modes beat rare catastrophic ones.
  • Error budget: if SLA is 99.9%, you have 0.1% failure budget per month. Spend it on deployments, experiments.

Senior rule:

Reliability is not a feature. It is the table stakes. Design for normal failure, or failure will design you.


TIMEOUTS + RETRIES + BACKOFF (THE TRINITY)

Rules:

  • always set timeouts
  • retry only idempotent operations (or make them idempotent)
  • exponential backoff + jitter

Avoid:

  • synchronized retries (retry storm)

Exponential backoff with jitter (pseudocode):

attempt = 0
max_attempts = 5
base_delay = 100ms
max_delay = 30s

while attempt < max_attempts:
try:
result = call_dependency()
return result
except RetryableError:
attempt += 1
if attempt >= max_attempts:
raise
delay = min(base_delay * 2^attempt, max_delay)
jitter = random(0, delay * 0.2) // 20% jitter
sleep(delay + jitter)

Retry policies by operation type:

Operation TypeRetry?Max AttemptsBackoff
Read (GET)Yes3–5Exponential + jitter
Write (POST)Only if idempotent2–3Exponential + jitter
Payment/ChargeNo (or with idempotency key)1N/A
Search/QueryYes3–5Exponential + jitter
Webhook outboundYes3–5Exponential + jitter

Retryable vs non-retryable errors:

  • Retry: 5xx, timeout, connection refused, 429 (with backoff).
  • Do not retry: 4xx (except 429), validation errors, auth failures.

Make writes idempotent for retries:

  • Idempotency key: client sends Idempotency-Key: <uuid> with POST.
  • Server: first request with key → process and cache result; duplicate → return cached result.
  • Required for: payments, order creation, webhook delivery.

Retry budgets:

  • Cap total retries across all clients. Example: 20% of requests may retry.
  • Prevents a failing dependency from amplifying load via retries.
  • When budget exhausted: fail fast, surface error, stop hammering.
  • Implementation: track retry count per time window; when budget exceeded, reject new retries until window resets.

Timeout guidelines:

  • Set timeouts at every boundary: HTTP client, DB connection, queue receive.
  • Chain rule: caller timeout > callee timeout. Otherwise you wait for nothing.
  • Typical: 2–5s for fast APIs, 30s for heavy operations, 60s+ for batch jobs.
  • Never use infinite timeouts. A hung dependency should eventually fail, not block forever.

Senior rule:

Retries without backoff and jitter turn a single failure into a DDoS on your own dependency.


CIRCUIT BREAKERS + BULKHEADS

  • circuit breaker stops calling a failing dependency
  • bulkheads isolate resources so one failure doesn't sink the system

UI equivalent:

  • degrade features instead of blocking the whole app

Circuit breaker state diagram:

    ┌─────────┐   failure threshold   ┌─────────┐
│ CLOSED │ ────────────────────► │ OPEN │
│ (normal)│ │ (fail │
└─────────┘ │ fast) │
▲ └────┬────┘
│ │
│ success │ timeout
│ ▼
┌────┴────┐ ┌─────────────┐
│ HALF-OPEN│ ◄────────────────────│ │
│ (probe) │ allow 1 test call │ │
└─────────┘ └─────────────┘

Concrete example — payment service circuit breaker:

  • Closed: All payment requests go through.
  • Open: After 5 failures in 10s, stop calling payment service. Return "Payment temporarily unavailable" immediately.
  • Half-open: After 30s, allow 1 test request. Success → closed. Failure → open again.

UI circuit breaker equivalent:

  • If recommendations API fails 3 times, hide the recommendations block. Show "Suggestions temporarily unavailable."
  • Don't block the whole page. Isolate the failing component.

Bulkhead pattern — pool isolation:

┌─────────────────────────────────────────────────────────┐
│ APPLICATION │
├─────────────┬─────────────┬─────────────┬───────────────┤
│ Pool A │ Pool B │ Pool C │ Pool D │
│ (payments) │ (search) │ (notify) │ (analytics) │
│ max 10 conn │ max 20 conn │ max 5 conn │ max 15 conn │
└─────────────┴─────────────┴─────────────┴───────────────┘
  • One saturated dependency (e.g., payments) cannot exhaust connections for others.
  • Search, notify, analytics keep working when payments is slow or down.

Circuit breaker configuration (typical):

  • Failure threshold: 5 failures in 10s
  • Open duration: 30s before half-open
  • Half-open: allow 1–3 test requests; success → closed, failure → open

Senior rule:

Circuit breakers protect the caller. Bulkheads protect the system. Use both.


RATE LIMITING + BACKPRESSURE

  • apply limits per user/tenant
  • protect expensive endpoints
  • apply queue depth limits

Signal backpressure clearly:

  • 429 with retry-after
  • typed error response

Rate limiting algorithms:

AlgorithmHow it worksUse case
Token bucketRefill tokens at fixed rateAPI throttling, burst allow
Sliding windowCount requests in rolling windowPer-minute limits
Fixed windowReset at interval boundarySimple quotas
Leaky bucketProcess at fixed rateSmooth output rate

Token bucket (conceptual):

Bucket capacity: 100 tokens
Refill rate: 10 tokens/sec

Request arrives → consume 1 token
If tokens > 0: allow
If tokens = 0: reject with 429, Retry-After: X

Sliding window:

  • Track request timestamps. Count requests in last 60 seconds (rolling).
  • If count >= limit, reject. More accurate than fixed window (no burst at boundary).
  • Trade-off: more memory (store timestamps) vs. fixed window (simpler, bursty).

Where to rate limit:

  • API gateway: per-user, per-IP, per-API-key.
  • Service mesh: per-service, per-endpoint.
  • Application: per-tenant for multi-tenant SaaS.

Backpressure propagation:

  • Queue depth limit: When queue exceeds N, reject new work. Return 503 or 429.
  • Client behavior: Honor Retry-After, implement exponential backoff.
  • Cascading: Each layer propagates "I'm overloaded" upward.
  • Load shedding: When overloaded, reject low-priority requests first. Preserve critical path.
  • Queue depth → reject: When worker queue exceeds N (e.g., 1000), API returns 503. Client backs off.

Response format for backpressure:

{
"error": "rate_limited",
"retry_after_seconds": 60,
"message": "Too many requests. Retry after 60 seconds."
}
  • Include Retry-After header. Clients can parse and sleep.
  • Use consistent error codes: 429 for rate limit, 503 for overload.

Senior rule:

Backpressure without clear signals (429, Retry-After) forces clients to guess. Guessing causes thundering herds.


POISON MESSAGES + DLQs

Async systems must assume:

  • malformed payloads
  • unprocessable messages
  • repeated failures

Pattern:

  • limited retries
  • then DLQ
  • then replay tooling

Worker rule:

Consumers must be idempotent because delivery is at-least-once.

Async flow: queue → consumer → DLQ → replay

Visibility tooling requirements:

  • DLQ depth dashboard — alert when depth > 0
  • Message inspection — view payload, headers, error reason
  • Replay by ID or batch — reprocess after code/schema fix
  • Metrics — messages per second to DLQ, by error type

Why messages become poison:

  • Schema change: producer upgraded, consumer expects old format.
  • Bad data: null where non-null expected, invalid enum, malformed JSON.
  • External failure: downstream API rejects; retries exhausted.
  • Ordering: message processed out of order; depends on prior message not yet available.

Alerting on DLQ depth:

  • Critical: DLQ depth > 10 for 5 minutes — page on-call
  • Warning: DLQ depth > 0 for 15 minutes — Slack alert
  • Info: DLQ receives any message — log for trending

Operational tips:

  • Set max receive count (e.g., 3) before DLQ. Don't retry forever.
  • Preserve original message metadata on DLQ for debugging.
  • Document replay runbook: when to replay, who approves, how to verify.
  • Consider separate DLQs by error type (schema vs. downstream vs. timeout) for targeted replay.

Senior rule:

A DLQ without replay tooling is a graveyard. Build replay before you need it.


ROLLOUTS THAT DON'T HURT

  • feature flags
  • canary deployments
  • progressive delivery
  • fast rollback

Senior rule:

Rollback is part of the design, not an emergency improvisation.

Feature flag types:

TypePurposeExample
ReleaseShip code behind flag, flip when readyNew checkout flow
ExperimentA/B testPricing page variant
OpsKill switch for risky featureDisable external API
PermissionRole-based accessBeta feature for admins

Flag hygiene:

  • Remove flags after rollout. Stale flags create technical debt.
  • Default to off for new flags. Explicitly enable for safe rollout.

Canary deployments — traffic percentages:

  • Start: 1% → new version
  • Observe: error rate, latency, business metrics
  • Ramp: 5% → 10% → 25% → 50% → 100%
  • Rollback: any metric degradation → revert to 0%
  • Hold at each step for 5–15 minutes. Latent bugs surface over time.

Blue-green deployments:

┌─────────┐                    ┌─────────┐
│ Blue │ ◄─── traffic ─────► │ Green │
│ (live) │ switch │ (new) │
└─────────┘ └─────────┘
  • Two identical environments. Deploy to inactive (green). Switch traffic in one cut.
  • Rollback: switch traffic back to blue. Instant.
  • Caveat: DB migrations must be backward-compatible. Both versions run against same DB during switch.

Progressive delivery with metrics gates:

  • Auto-promote only if: error rate < 0.1%, p99 latency < 500ms, no critical alerts.
  • Manual gates: pause at 10%, 50% for human review.
  • Automated rollback: if gate fails, revert automatically.

Rollout strategy comparison:

StrategyRollback speedResource costComplexity
Big bangSlowLowLow
Blue-greenInstant2xMedium
CanaryFast1x + routingMedium
ProgressiveFast1x + gatesHigh

Rollback checklist:

  • One-command rollback. No manual steps.
  • Database migrations: support both old and new code during rollback window.
  • Feature flags: instant kill switch for risky features.

Senior rule:

Canary without metrics gates is a hope-based deployment. Gate on real signals.


GRACEFUL DEGRADATION

When dependencies fail, degrade features instead of failing the whole system.

Fallback strategies:

StrategyWhen to useExample
Cached responseRead-heavy, stale OKProduct catalog from cache
Simplified UINon-critical featureHide recommendations block
Read-only modeWrites failingDisable checkout, show catalog
Default valueOptional enrichmentShow "—" instead of score
Stale-while-revalidateCan show old dataServe cached, refresh in background

Dependency classification:

  • Critical: No fallback. Payment, auth, core order flow. Fail loudly if down.
  • Optional: Can degrade. Recommendations, search suggestions, analytics.
  • Enrichment: Nice-to-have. Avatars, badges, social counts. Default or hide.

Document the classification:

  • Maintain a dependency map: service → criticality → fallback.
  • Review quarterly. Dependencies and criticality change.
┌─────────────────────────────────────────────────────────┐
│ REQUEST │
├─────────────────────────────────────────────────────────┤
│ Critical: Auth, Payment → Must succeed or fail fast │
│ Optional: Search, Recs → Fallback to cached/simple │
│ Enrichment: Avatars → Hide or show placeholder │
└─────────────────────────────────────────────────────────┘

Implementation pattern:

  • Wrap optional calls in try/catch or fallback chain.
  • Return degraded result (cached, default, empty) on failure.
  • Log degradation for observability. Don't silently hide failures forever.
  • Consider circuit breaker for optional deps: after N failures, skip call and use fallback for a cooldown period.

Senior rule:

Classify dependencies before you need to. Deciding during an outage is too late.


HEALTH CHECKS & READINESS

Liveness vs readiness probes:

ProbePurposeFailure action
LivenessIs the process alive?Restart container
ReadinessCan it accept traffic?Remove from load balancer
  • Liveness: Simple check (e.g., /health returns 200). Fails → orchestrator kills and restarts.
  • Readiness: Check dependencies (DB, cache, downstream APIs). Fails → stop sending traffic until ready.
  • Never check heavy dependencies in liveness. A slow DB would kill healthy processes.
  • Startup probe (K8s): separate probe for slow-starting containers. Gives more time before liveness kicks in.
  • Prefer failing readiness over accepting traffic that will fail. Better to wait than to error 500.

Dependency health checks:

  • Readiness should verify: DB connection, cache connection, critical downstream reachable.
  • If any critical dependency is down, fail readiness. Do not accept traffic.

Startup ordering:

1. Process starts
2. Liveness: OK (process is up)
3. Connect to DB, cache, etc.
4. Readiness: OK (dependencies ready)
5. Load balancer adds to pool
6. Traffic flows
  • Use slow-start readiness: delay readiness until warm-up (e.g., cache populated, connections open).

Probe configuration (Kubernetes example):

livenessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 10
periodSeconds: 10
failureThreshold: 3

readinessProbe:
httpGet:
path: /ready
port: 8080
initialDelaySeconds: 5
periodSeconds: 5
failureThreshold: 2
  • Liveness: /health — process alive. Don't check dependencies here.
  • Readiness: /ready — DB, cache, critical deps OK. Fail this when overloaded.
  • Overloaded? Fail readiness. Orchestrator stops sending traffic. Prevents cascade.

Senior rule:

Readiness probes that don't check dependencies send traffic to a process that will fail. Check what matters.


CHAOS ENGINEERING (LITE)

Controlled failure injection to validate resilience.

Simple chaos experiments:

  • Kill one random instance every 5 minutes.
  • Add 500ms latency to a dependency.
  • Return 5xx from a dependency for 1% of requests.
  • Fill disk on a test node.

Tools:

ToolPurposeComplexity
GremlinHosted chaos platformMedium
LitmusChaosKubernetes-native chaosMedium
Custom scriptsKill pods, add latencyLow

Start small:

  • Begin with process kill (pod delete). Validate restart and recovery.
  • Add latency injection. Validate timeouts and circuit breakers.
  • Then: network partition, disk fill, CPU stress.
  • Run in staging first. Production chaos is for mature teams with solid observability.

Game days:

  • Schedule a 2-hour window. Run chaos in a non-prod or staging.
  • Roles: chaos runner, observer, incident commander.
  • Goal: validate runbooks, alerting, and rollback procedures.
  • Debrief: what broke, what we learned, what to fix.

Experiment design template:

  1. Hypothesis: If [dependency X] fails, [system Y] will [expected behavior].
  2. Blast radius: What is affected? Limit to one service, one region.
  3. Abort criteria: Rollback if error rate > 5%, latency > 2x baseline.
  4. Runbook: Step-by-step to stop the experiment and restore.

Never run chaos in production without:

  • Explicit approval and scheduled window.
  • Rollback plan and on-call ready.
  • Metrics dashboards to observe impact.

Senior rule:

Chaos engineering without a hypothesis is vandalism. Start with: "If X fails, we expect Y to happen."


EXERCISES

  1. Identify endpoints that must be idempotent and how you'll guarantee it.

  2. Design a retry policy (backoff + jitter) for a flaky dependency.

  3. Define DLQ rules and replay process.

  4. Write a rollout plan for a risky change.

  5. Classify your dependencies: critical, optional, enrichment. For each optional one, define a fallback.

  6. Design liveness and readiness probes for a service that depends on DB, cache, and one external API.

  7. Propose a chaos experiment for your system. State the hypothesis and expected outcome.

  8. For a service you own: list its dependencies, classify each, and define fallbacks for optional ones.

  9. Design a retry budget for an API that calls 3 downstream services. What percentage would you allow? How would you enforce it?

  10. Sketch a circuit breaker for a recommendation service. What failure threshold? What open duration? When would you go half-open?

  11. Draw the async flow for a message that becomes poison. Include: main queue, consumer, retries, DLQ, replay. What happens at each step?


🏁 END — RELIABILITY PATTERNS