CASE STUDY — Part IV: Idempotent Write API (Outbox + Worker)
SCENARIO
A user clicks "Submit" and sees a spinner. The network flakes.
They retry. Now you have:
-
duplicate records
-
duplicated side effects (emails/webhooks)
You need a backend design that treats retries as normal.
When does this happen in the real world?
-
Mobile apps: Flaky cellular networks, app backgrounding, and timeouts cause users to tap "Submit" multiple times. Without idempotency, each tap creates a new order or subscription.
-
Microservices: Service A calls Service B; B times out before responding. A retries. B may have already processed the first request. You get double-charges, double-fulfillment, or duplicate notifications.
-
Webhooks: Your system sends a webhook to a partner. The partner's endpoint returns 5xx or times out. You retry. The partner may have processed the first delivery. They receive the same event twice and create duplicate records on their side.
-
Browser back/refresh: User hits back, then forward, or refreshes during a slow POST. The browser may resend the request. Same payload, same intent—but your API sees it as a new request.
Senior rule:
Retries are not exceptional. They are the default. Design for them from day one.
INVARIANTS
-
Create request is idempotent for the same logical action. If the client sends the same idempotency key twice (with the same payload), the server returns the same response and performs no duplicate work.
-
Side effects happen at most once. Emails, webhooks, downstream API calls, and file processing must not be duplicated for a single logical action.
What happens if violated?
| Violation | Consequence |
|---|---|
| Idempotency not enforced | Duplicate orders, double charges, duplicate user accounts. Support tickets, refunds, data cleanup. |
| Side effects duplicated | Users receive 3 welcome emails. Partners get the same webhook 5 times. Downstream systems create duplicate records. |
| Inconsistent response on retry | Client gets different responses for same key → client can't safely retry; they don't know which result is "real." |
API CONTRACT
-
POST /resource-
header:
Idempotency-Key(e.g., UUID v4 from client) -
body: request payload
-
Rules:
-
Store
(key, userId, requestHash, response)in a durable store (DB or cache with persistence). -
Same key + same payload → return stored response (no re-execution).
-
Same key + different payload → return deterministic error (e.g.,
409 Conflictwith message "Idempotency key reused with different request body").
Pseudocode for idempotency check:
on POST /resource:
key = request.headers["Idempotency-Key"]
if key is missing or invalid:
return 400 Bad Request
stored = idempotency_store.get(key, userId)
if stored:
if hash(stored.requestBody) == hash(request.body):
return stored.response // cached, no side effects
else:
return 409 Conflict // key collision: same key, different payload
// First time: execute
response = execute_create(request.body)
idempotency_store.set(key, userId, hash(request.body), response, ttl=24h)
return response
Key collision handling: When the same idempotency key is used with a different payload, you must fail deterministically. Never apply the new payload—the client made an error (e.g., reused a key across different requests). Return 409 Conflict so they can generate a new key and retry with the correct payload.
OUTBOX PATTERN
Goal: DB write and "event to process side effects" commit atomically. If the transaction commits, the side effect will be processed. If it rolls back, nothing is written.
Mechanism:
-
Write resource row (e.g.,
orders). -
Write outbox row in the same transaction (e.g.,
outbox_eventswithorder_id,event_type,payload,status=pending). -
Worker polls outbox, processes pending rows, emits email/webhook, marks
status=processed.
Transaction boundary: Both writes must be in one transaction. If you write the order first and the outbox insert fails, you have an order with no corresponding event. If you write the outbox first and the order insert fails, you have an orphan event. Atomicity is critical.
ASCII diagram:
┌─────────────────────────────────────────────────────────────┐
│ API Request │
│ POST /orders (Idempotency-Key: abc-123) │
└──────────────────────────┬──────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ SINGLE TRANSACTION │
│ ┌─────────────┐ ┌─────────────────┐ │
│ │ INSERT │ │ INSERT │ │
│ │ orders │ + │ outbox_events │ → COMMIT │
│ │ (id=1,...) │ │ (order_id=1, │ │
│ └─────────────┘ │ status=pending) │ │
│ └─────────────────┘ │
└──────────────────────────┬──────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ Outbox Worker (separate process) │
│ SELECT * FROM outbox_events WHERE status='pending' │
│ FOR EACH: send webhook/email → UPDATE status='processed' │
└─────────────────────────────────────────────────────────────┘
Worker idempotency: The worker dedupes by outbox row ID. Downstream systems (e.g., webhook receiver) should also dedupe by a stable delivery ID (e.g., outbox_id or event_id) so that retries don't cause duplicate processing.
FAILURE MODES
| Failure | What goes wrong | Mitigation |
|---|---|---|
| Worker crash after emitting webhook | Webhook sent, but outbox row not marked processed. Worker restarts, sends again. | Downstream dedupes by event_id. Or: store delivery records; before sending, check if already delivered. |
| DB crash mid-transaction | Transaction rolls back. No order, no outbox row. Client retries with same key → idempotency returns cached error or retries create. | Normal. Idempotency store survives; client retry is safe. |
| Outbox reader falls behind | High write volume; worker can't keep up. Backlog grows. Side effects delayed (emails late, webhooks stale). | Scale workers. Alert on outbox_lag_seconds. Consider partitioning outbox by tenant or event type. |
| Duplicate webhook delivery | Partner receives same event twice (e.g., their 5xx caused your retry). They create duplicate records. | Partner must implement idempotency (accept event_id, dedupe). Document this in your webhook contract. |
| Idempotency store unavailable | Can't check/store keys. Two options: fail open (allow duplicates) or fail closed (reject all writes). | Fail closed for financial/critical writes. Fail open only for non-critical, with alerting. |
OBSERVABILITY
| Metric | Why it matters |
|---|---|
| Idempotency reuse rate | % of requests that hit cached response. High = many retries; confirms design is working. Low = either few retries or key not being sent. |
| Outbox lag | Time from outbox insert to processed. Alert if > N minutes. |
| Side-effect delivery failures | Webhook/email send failures. Track by destination, error type. |
| Idempotency key collisions (409) | Same key, different payload. Indicates client bug; monitor for spikes. |
Alerting patterns:
-
outbox_lag_seconds > 300→ page on-call. -
idempotency_409_rate > 0.01→ investigate client implementations. -
side_effect_failure_rate > 0.05→ check downstream health, credentials.
EXERCISE
Apply this pattern to:
-
Signup welcome email: User signs up; send welcome email. Retries must not send multiple emails. Design the API contract (idempotency key source?), outbox schema, and worker behavior.
-
File upload processing: User uploads a file; async job resizes/thumbnails it. Retries must not create duplicate jobs. Where does idempotency live—upload API or job queue? How do you dedupe the worker?
-
Checkout fulfillment: User completes checkout; system charges card and creates shipment. Both must happen at most once. Sketch the transaction boundary, outbox events, and how you'd handle partial failure (card charged but shipment creation failed).
Architecture-level questions:
-
Should idempotency keys be client-generated or server-generated? What are the tradeoffs? (Client-generated: client controls retries, must generate UUIDs. Server-generated: simpler client, but server must return key on first response for client to retry.)
-
How long should you retain idempotency records? What if a client retries after 48 hours? (Typical: 24–72 hours. Beyond that, treat as new request or return 410 Gone.)
-
For a multi-region deployment, where does the idempotency store live? What are the consistency requirements? (Must be strongly consistent within region. Cross-region: consider global store or accept that retries from different regions may not dedupe.)
-
How do you handle partial failures in the outbox worker? (E.g., webhook sent, DB update to mark processed fails. Use at-least-once delivery + downstream dedupe.)