RELIABILITY IS DESIGNING FOR NORMAL FAILURE
Production is partial failure:
- timeouts
- dependency brownouts
- retries
- backpressure
Senior fullstack means the UI and backend cooperate to fail safely.
The cost of unreliability:
- User trust — One bad outage erases months of goodwill. Users remember.
- SLA breaches — Downtime credits, contractual penalties, enterprise churn.
- Revenue — Every second of outage costs money. E‑commerce, ads, SaaS all bleed.
- On-call burnout — Unreliable systems mean 3 a.m. pages. Engineers quit.
Quantify it:
- Track: MTTR (mean time to recover), error budget consumption, incident frequency.
- Goal: reduce surprise. Predictable failure modes beat rare catastrophic ones.
- Error budget: if SLA is 99.9%, you have 0.1% failure budget per month. Spend it on deployments, experiments.
Senior rule:
Reliability is not a feature. It is the table stakes. Design for normal failure, or failure will design you.
TIMEOUTS + RETRIES + BACKOFF (THE TRINITY)
Rules:
- always set timeouts
- retry only idempotent operations (or make them idempotent)
- exponential backoff + jitter
Avoid:
- synchronized retries (retry storm)
Exponential backoff with jitter (pseudocode):
attempt = 0
max_attempts = 5
base_delay = 100ms
max_delay = 30s
while attempt < max_attempts:
try:
result = call_dependency()
return result
except RetryableError:
attempt += 1
if attempt >= max_attempts:
raise
delay = min(base_delay * 2^attempt, max_delay)
jitter = random(0, delay * 0.2) // 20% jitter
sleep(delay + jitter)
Retry policies by operation type:
| Operation Type | Retry? | Max Attempts | Backoff |
|---|---|---|---|
| Read (GET) | Yes | 3–5 | Exponential + jitter |
| Write (POST) | Only if idempotent | 2–3 | Exponential + jitter |
| Payment/Charge | No (or with idempotency key) | 1 | N/A |
| Search/Query | Yes | 3–5 | Exponential + jitter |
| Webhook outbound | Yes | 3–5 | Exponential + jitter |
Retryable vs non-retryable errors:
- Retry: 5xx, timeout, connection refused, 429 (with backoff).
- Do not retry: 4xx (except 429), validation errors, auth failures.
Make writes idempotent for retries:
- Idempotency key: client sends
Idempotency-Key: <uuid>with POST. - Server: first request with key → process and cache result; duplicate → return cached result.
- Required for: payments, order creation, webhook delivery.
Retry budgets:
- Cap total retries across all clients. Example: 20% of requests may retry.
- Prevents a failing dependency from amplifying load via retries.
- When budget exhausted: fail fast, surface error, stop hammering.
- Implementation: track retry count per time window; when budget exceeded, reject new retries until window resets.
Timeout guidelines:
- Set timeouts at every boundary: HTTP client, DB connection, queue receive.
- Chain rule: caller timeout > callee timeout. Otherwise you wait for nothing.
- Typical: 2–5s for fast APIs, 30s for heavy operations, 60s+ for batch jobs.
- Never use infinite timeouts. A hung dependency should eventually fail, not block forever.
Senior rule:
Retries without backoff and jitter turn a single failure into a DDoS on your own dependency.
CIRCUIT BREAKERS + BULKHEADS
- circuit breaker stops calling a failing dependency
- bulkheads isolate resources so one failure doesn't sink the system
UI equivalent:
- degrade features instead of blocking the whole app
Circuit breaker state diagram:
┌─────────┐ failure threshold ┌─────────┐
│ CLOSED │ ────────────────────► │ OPEN │
│ (normal)│ │ (fail │
└─────────┘ │ fast) │
▲ └────┬────┘
│ │
│ success │ timeout
│ ▼
┌────┴────┐ ┌─────────────┐
│ HALF-OPEN│ ◄────────────────────│ │
│ (probe) │ allow 1 test call │ │
└─────────┘ └─────────────┘
Concrete example — payment service circuit breaker:
- Closed: All payment requests go through.
- Open: After 5 failures in 10s, stop calling payment service. Return "Payment temporarily unavailable" immediately.
- Half-open: After 30s, allow 1 test request. Success → closed. Failure → open again.
UI circuit breaker equivalent:
- If recommendations API fails 3 times, hide the recommendations block. Show "Suggestions temporarily unavailable."
- Don't block the whole page. Isolate the failing component.
Bulkhead pattern — pool isolation:
┌─────────────────────────────────────────────────────────┐
│ APPLICATION │
├─────────────┬─────────────┬─────────────┬───────────────┤
│ Pool A │ Pool B │ Pool C │ Pool D │
│ (payments) │ (search) │ (notify) │ (analytics) │
│ max 10 conn │ max 20 conn │ max 5 conn │ max 15 conn │
└─────────────┴─────────────┴─────────────┴───────────────┘
- One saturated dependency (e.g., payments) cannot exhaust connections for others.
- Search, notify, analytics keep working when payments is slow or down.
Circuit breaker configuration (typical):
- Failure threshold: 5 failures in 10s
- Open duration: 30s before half-open
- Half-open: allow 1–3 test requests; success → closed, failure → open
Senior rule:
Circuit breakers protect the caller. Bulkheads protect the system. Use both.
RATE LIMITING + BACKPRESSURE
- apply limits per user/tenant
- protect expensive endpoints
- apply queue depth limits
Signal backpressure clearly:
429with retry-after- typed error response
Rate limiting algorithms:
| Algorithm | How it works | Use case |
|---|---|---|
| Token bucket | Refill tokens at fixed rate | API throttling, burst allow |
| Sliding window | Count requests in rolling window | Per-minute limits |
| Fixed window | Reset at interval boundary | Simple quotas |
| Leaky bucket | Process at fixed rate | Smooth output rate |
Token bucket (conceptual):
Bucket capacity: 100 tokens
Refill rate: 10 tokens/sec
Request arrives → consume 1 token
If tokens > 0: allow
If tokens = 0: reject with 429, Retry-After: X
Sliding window:
- Track request timestamps. Count requests in last 60 seconds (rolling).
- If count >= limit, reject. More accurate than fixed window (no burst at boundary).
- Trade-off: more memory (store timestamps) vs. fixed window (simpler, bursty).
Where to rate limit:
- API gateway: per-user, per-IP, per-API-key.
- Service mesh: per-service, per-endpoint.
- Application: per-tenant for multi-tenant SaaS.
Backpressure propagation:
- Queue depth limit: When queue exceeds N, reject new work. Return 503 or 429.
- Client behavior: Honor Retry-After, implement exponential backoff.
- Cascading: Each layer propagates "I'm overloaded" upward.
- Load shedding: When overloaded, reject low-priority requests first. Preserve critical path.
- Queue depth → reject: When worker queue exceeds N (e.g., 1000), API returns 503. Client backs off.
Response format for backpressure:
{
"error": "rate_limited",
"retry_after_seconds": 60,
"message": "Too many requests. Retry after 60 seconds."
}
- Include
Retry-Afterheader. Clients can parse and sleep. - Use consistent error codes: 429 for rate limit, 503 for overload.
Senior rule:
Backpressure without clear signals (429, Retry-After) forces clients to guess. Guessing causes thundering herds.
POISON MESSAGES + DLQs
Async systems must assume:
- malformed payloads
- unprocessable messages
- repeated failures
Pattern:
- limited retries
- then DLQ
- then replay tooling
Worker rule:
Consumers must be idempotent because delivery is at-least-once.
Async flow: queue → consumer → DLQ → replay
Visibility tooling requirements:
- DLQ depth dashboard — alert when depth > 0
- Message inspection — view payload, headers, error reason
- Replay by ID or batch — reprocess after code/schema fix
- Metrics — messages per second to DLQ, by error type
Why messages become poison:
- Schema change: producer upgraded, consumer expects old format.
- Bad data: null where non-null expected, invalid enum, malformed JSON.
- External failure: downstream API rejects; retries exhausted.
- Ordering: message processed out of order; depends on prior message not yet available.
Alerting on DLQ depth:
- Critical: DLQ depth > 10 for 5 minutes — page on-call
- Warning: DLQ depth > 0 for 15 minutes — Slack alert
- Info: DLQ receives any message — log for trending
Operational tips:
- Set max receive count (e.g., 3) before DLQ. Don't retry forever.
- Preserve original message metadata on DLQ for debugging.
- Document replay runbook: when to replay, who approves, how to verify.
- Consider separate DLQs by error type (schema vs. downstream vs. timeout) for targeted replay.
Senior rule:
A DLQ without replay tooling is a graveyard. Build replay before you need it.
ROLLOUTS THAT DON'T HURT
- feature flags
- canary deployments
- progressive delivery
- fast rollback
Senior rule:
Rollback is part of the design, not an emergency improvisation.
Feature flag types:
| Type | Purpose | Example |
|---|---|---|
| Release | Ship code behind flag, flip when ready | New checkout flow |
| Experiment | A/B test | Pricing page variant |
| Ops | Kill switch for risky feature | Disable external API |
| Permission | Role-based access | Beta feature for admins |
Flag hygiene:
- Remove flags after rollout. Stale flags create technical debt.
- Default to off for new flags. Explicitly enable for safe rollout.
Canary deployments — traffic percentages:
- Start: 1% → new version
- Observe: error rate, latency, business metrics
- Ramp: 5% → 10% → 25% → 50% → 100%
- Rollback: any metric degradation → revert to 0%
- Hold at each step for 5–15 minutes. Latent bugs surface over time.
Blue-green deployments:
┌─────────┐ ┌─────────┐
│ Blue │ ◄─── traffic ─────► │ Green │
│ (live) │ switch │ (new) │
└─────────┘ └─────────┘
- Two identical environments. Deploy to inactive (green). Switch traffic in one cut.
- Rollback: switch traffic back to blue. Instant.
- Caveat: DB migrations must be backward-compatible. Both versions run against same DB during switch.
Progressive delivery with metrics gates:
- Auto-promote only if: error rate < 0.1%, p99 latency < 500ms, no critical alerts.
- Manual gates: pause at 10%, 50% for human review.
- Automated rollback: if gate fails, revert automatically.
Rollout strategy comparison:
| Strategy | Rollback speed | Resource cost | Complexity |
|---|---|---|---|
| Big bang | Slow | Low | Low |
| Blue-green | Instant | 2x | Medium |
| Canary | Fast | 1x + routing | Medium |
| Progressive | Fast | 1x + gates | High |
Rollback checklist:
- One-command rollback. No manual steps.
- Database migrations: support both old and new code during rollback window.
- Feature flags: instant kill switch for risky features.
Senior rule:
Canary without metrics gates is a hope-based deployment. Gate on real signals.
GRACEFUL DEGRADATION
When dependencies fail, degrade features instead of failing the whole system.
Fallback strategies:
| Strategy | When to use | Example |
|---|---|---|
| Cached response | Read-heavy, stale OK | Product catalog from cache |
| Simplified UI | Non-critical feature | Hide recommendations block |
| Read-only mode | Writes failing | Disable checkout, show catalog |
| Default value | Optional enrichment | Show "—" instead of score |
| Stale-while-revalidate | Can show old data | Serve cached, refresh in background |
Dependency classification:
- Critical: No fallback. Payment, auth, core order flow. Fail loudly if down.
- Optional: Can degrade. Recommendations, search suggestions, analytics.
- Enrichment: Nice-to-have. Avatars, badges, social counts. Default or hide.
Document the classification:
- Maintain a dependency map: service → criticality → fallback.
- Review quarterly. Dependencies and criticality change.
┌─────────────────────────────────────────────────────────┐
│ REQUEST │
├─────────────────────────────────────────────────────────┤
│ Critical: Auth, Payment → Must succeed or fail fast │
│ Optional: Search, Recs → Fallback to cached/simple │
│ Enrichment: Avatars → Hide or show placeholder │
└─────────────────────────────────────────────────────────┘
Implementation pattern:
- Wrap optional calls in try/catch or fallback chain.
- Return degraded result (cached, default, empty) on failure.
- Log degradation for observability. Don't silently hide failures forever.
- Consider circuit breaker for optional deps: after N failures, skip call and use fallback for a cooldown period.
Senior rule:
Classify dependencies before you need to. Deciding during an outage is too late.
HEALTH CHECKS & READINESS
Liveness vs readiness probes:
| Probe | Purpose | Failure action |
|---|---|---|
| Liveness | Is the process alive? | Restart container |
| Readiness | Can it accept traffic? | Remove from load balancer |
- Liveness: Simple check (e.g.,
/healthreturns 200). Fails → orchestrator kills and restarts. - Readiness: Check dependencies (DB, cache, downstream APIs). Fails → stop sending traffic until ready.
- Never check heavy dependencies in liveness. A slow DB would kill healthy processes.
- Startup probe (K8s): separate probe for slow-starting containers. Gives more time before liveness kicks in.
- Prefer failing readiness over accepting traffic that will fail. Better to wait than to error 500.
Dependency health checks:
- Readiness should verify: DB connection, cache connection, critical downstream reachable.
- If any critical dependency is down, fail readiness. Do not accept traffic.
Startup ordering:
1. Process starts
2. Liveness: OK (process is up)
3. Connect to DB, cache, etc.
4. Readiness: OK (dependencies ready)
5. Load balancer adds to pool
6. Traffic flows
- Use slow-start readiness: delay readiness until warm-up (e.g., cache populated, connections open).
Probe configuration (Kubernetes example):
livenessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 10
periodSeconds: 10
failureThreshold: 3
readinessProbe:
httpGet:
path: /ready
port: 8080
initialDelaySeconds: 5
periodSeconds: 5
failureThreshold: 2
- Liveness:
/health— process alive. Don't check dependencies here. - Readiness:
/ready— DB, cache, critical deps OK. Fail this when overloaded. - Overloaded? Fail readiness. Orchestrator stops sending traffic. Prevents cascade.
Senior rule:
Readiness probes that don't check dependencies send traffic to a process that will fail. Check what matters.
CHAOS ENGINEERING (LITE)
Controlled failure injection to validate resilience.
Simple chaos experiments:
- Kill one random instance every 5 minutes.
- Add 500ms latency to a dependency.
- Return 5xx from a dependency for 1% of requests.
- Fill disk on a test node.
Tools:
| Tool | Purpose | Complexity |
|---|---|---|
| Gremlin | Hosted chaos platform | Medium |
| LitmusChaos | Kubernetes-native chaos | Medium |
| Custom scripts | Kill pods, add latency | Low |
Start small:
- Begin with process kill (pod delete). Validate restart and recovery.
- Add latency injection. Validate timeouts and circuit breakers.
- Then: network partition, disk fill, CPU stress.
- Run in staging first. Production chaos is for mature teams with solid observability.
Game days:
- Schedule a 2-hour window. Run chaos in a non-prod or staging.
- Roles: chaos runner, observer, incident commander.
- Goal: validate runbooks, alerting, and rollback procedures.
- Debrief: what broke, what we learned, what to fix.
Experiment design template:
- Hypothesis: If [dependency X] fails, [system Y] will [expected behavior].
- Blast radius: What is affected? Limit to one service, one region.
- Abort criteria: Rollback if error rate > 5%, latency > 2x baseline.
- Runbook: Step-by-step to stop the experiment and restore.
Never run chaos in production without:
- Explicit approval and scheduled window.
- Rollback plan and on-call ready.
- Metrics dashboards to observe impact.
Senior rule:
Chaos engineering without a hypothesis is vandalism. Start with: "If X fails, we expect Y to happen."
EXERCISES
-
Identify endpoints that must be idempotent and how you'll guarantee it.
-
Design a retry policy (backoff + jitter) for a flaky dependency.
-
Define DLQ rules and replay process.
-
Write a rollout plan for a risky change.
-
Classify your dependencies: critical, optional, enrichment. For each optional one, define a fallback.
-
Design liveness and readiness probes for a service that depends on DB, cache, and one external API.
-
Propose a chaos experiment for your system. State the hypothesis and expected outcome.
-
For a service you own: list its dependencies, classify each, and define fallbacks for optional ones.
-
Design a retry budget for an API that calls 3 downstream services. What percentage would you allow? How would you enforce it?
-
Sketch a circuit breaker for a recommendation service. What failure threshold? What open duration? When would you go half-open?
-
Draw the async flow for a message that becomes poison. Include: main queue, consumer, retries, DLQ, replay. What happens at each step?