CASE STUDY — Part IV: Idempotent Write API (Outbox + Worker)

SCENARIO

A user clicks "Submit" and sees a spinner. The network flakes.

They retry. Now you have:

duplicate records
duplicated side effects (emails/webhooks)

You need a backend design that treats retries as normal.

When does this happen in the real world?

Mobile apps: Flaky cellular networks, app backgrounding, and timeouts cause users to tap "Submit" multiple times. Without idempotency, each tap creates a new order or subscription.
Microservices: Service A calls Service B; B times out before responding. A retries. B may have already processed the first request. You get double-charges, double-fulfillment, or duplicate notifications.
Webhooks: Your system sends a webhook to a partner. The partner's endpoint returns 5xx or times out. You retry. The partner may have processed the first delivery. They receive the same event twice and create duplicate records on their side.
Browser back/refresh: User hits back, then forward, or refreshes during a slow POST. The browser may resend the request. Same payload, same intent—but your API sees it as a new request.

Senior rule:

Retries are not exceptional. They are the default. Design for them from day one.

INVARIANTS

Create request is idempotent for the same logical action. If the client sends the same idempotency key twice (with the same payload), the server returns the same response and performs no duplicate work.
Side effects happen at most once. Emails, webhooks, downstream API calls, and file processing must not be duplicated for a single logical action.

What happens if violated?

Violation	Consequence
Idempotency not enforced	Duplicate orders, double charges, duplicate user accounts. Support tickets, refunds, data cleanup.
Side effects duplicated	Users receive 3 welcome emails. Partners get the same webhook 5 times. Downstream systems create duplicate records.
Inconsistent response on retry	Client gets different responses for same key → client can't safely retry; they don't know which result is "real."

API CONTRACT

POST /resource
- header: Idempotency-Key (e.g., UUID v4 from client)
- body: request payload

Rules:

Store (key, userId, requestHash, response) in a durable store (DB or cache with persistence).
Same key + same payload → return stored response (no re-execution).
Same key + different payload → return deterministic error (e.g., 409 Conflict with message "Idempotency key reused with different request body").

Pseudocode for idempotency check:

on POST /resource:
  key = request.headers["Idempotency-Key"]
  if key is missing or invalid:
    return 400 Bad Request

  stored = idempotency_store.get(key, userId)
  if stored:
    if hash(stored.requestBody) == hash(request.body):
      return stored.response  // cached, no side effects
    else:
      return 409 Conflict  // key collision: same key, different payload

  // First time: execute
  response = execute_create(request.body)
  idempotency_store.set(key, userId, hash(request.body), response, ttl=24h)
  return response

Key collision handling: When the same idempotency key is used with a different payload, you must fail deterministically. Never apply the new payload—the client made an error (e.g., reused a key across different requests). Return 409 Conflict so they can generate a new key and retry with the correct payload.

OUTBOX PATTERN

Goal: DB write and "event to process side effects" commit atomically. If the transaction commits, the side effect will be processed. If it rolls back, nothing is written.

Mechanism:

Write resource row (e.g., orders).
Write outbox row in the same transaction (e.g., outbox_events with order_id, event_type, payload, status=pending).
Worker polls outbox, processes pending rows, emits email/webhook, marks status=processed.

Transaction boundary: Both writes must be in one transaction. If you write the order first and the outbox insert fails, you have an order with no corresponding event. If you write the outbox first and the order insert fails, you have an orphan event. Atomicity is critical.

ASCII diagram:

┌─────────────────────────────────────────────────────────────┐
│  API Request                                                 │
│  POST /orders (Idempotency-Key: abc-123)                     │
└──────────────────────────┬──────────────────────────────────┘
                           │
                           ▼
┌─────────────────────────────────────────────────────────────┐
│  SINGLE TRANSACTION                                          │
│  ┌─────────────┐    ┌─────────────────┐                     │
│  │ INSERT      │    │ INSERT          │                     │
│  │ orders      │ +  │ outbox_events   │  → COMMIT           │
│  │ (id=1,...)  │    │ (order_id=1,    │                     │
│  └─────────────┘    │  status=pending) │                     │
│                     └─────────────────┘                     │
└──────────────────────────┬──────────────────────────────────┘
                           │
                           ▼
┌─────────────────────────────────────────────────────────────┐
│  Outbox Worker (separate process)                            │
│  SELECT * FROM outbox_events WHERE status='pending'         │
│  FOR EACH: send webhook/email → UPDATE status='processed'    │
└─────────────────────────────────────────────────────────────┘

Worker idempotency: The worker dedupes by outbox row ID. Downstream systems (e.g., webhook receiver) should also dedupe by a stable delivery ID (e.g., outbox_id or event_id) so that retries don't cause duplicate processing.

FAILURE MODES

Failure	What goes wrong	Mitigation
Worker crash after emitting webhook	Webhook sent, but outbox row not marked `processed`. Worker restarts, sends again.	Downstream dedupes by `event_id`. Or: store delivery records; before sending, check if already delivered.
DB crash mid-transaction	Transaction rolls back. No order, no outbox row. Client retries with same key → idempotency returns cached error or retries create.	Normal. Idempotency store survives; client retry is safe.
Outbox reader falls behind	High write volume; worker can't keep up. Backlog grows. Side effects delayed (emails late, webhooks stale).	Scale workers. Alert on `outbox_lag_seconds`. Consider partitioning outbox by tenant or event type.
Duplicate webhook delivery	Partner receives same event twice (e.g., their 5xx caused your retry). They create duplicate records.	Partner must implement idempotency (accept `event_id`, dedupe). Document this in your webhook contract.
Idempotency store unavailable	Can't check/store keys. Two options: fail open (allow duplicates) or fail closed (reject all writes).	Fail closed for financial/critical writes. Fail open only for non-critical, with alerting.

OBSERVABILITY

Metric	Why it matters
Idempotency reuse rate	% of requests that hit cached response. High = many retries; confirms design is working. Low = either few retries or key not being sent.
Outbox lag	Time from outbox insert to `processed`. Alert if > N minutes.
Side-effect delivery failures	Webhook/email send failures. Track by destination, error type.
Idempotency key collisions (409)	Same key, different payload. Indicates client bug; monitor for spikes.

Alerting patterns:

outbox_lag_seconds > 300 → page on-call.
idempotency_409_rate > 0.01 → investigate client implementations.
side_effect_failure_rate > 0.05 → check downstream health, credentials.

EXERCISE

Apply this pattern to:

Signup welcome email: User signs up; send welcome email. Retries must not send multiple emails. Design the API contract (idempotency key source?), outbox schema, and worker behavior.
File upload processing: User uploads a file; async job resizes/thumbnails it. Retries must not create duplicate jobs. Where does idempotency live—upload API or job queue? How do you dedupe the worker?
Checkout fulfillment: User completes checkout; system charges card and creates shipment. Both must happen at most once. Sketch the transaction boundary, outbox events, and how you'd handle partial failure (card charged but shipment creation failed).

Architecture-level questions:

Should idempotency keys be client-generated or server-generated? What are the tradeoffs? (Client-generated: client controls retries, must generate UUIDs. Server-generated: simpler client, but server must return key on first response for client to retry.)
How long should you retain idempotency records? What if a client retries after 48 hours? (Typical: 24–72 hours. Beyond that, treat as new request or return 410 Gone.)
For a multi-region deployment, where does the idempotency store live? What are the consistency requirements? (Must be strongly consistent within region. Cross-region: consider global store or accept that retries from different regions may not dedupe.)
How do you handle partial failures in the outbox worker? (E.g., webhook sent, DB update to mark processed fails. Use at-least-once delivery + downstream dedupe.)

🏁 END — PART IV CASE STUDY