Async Processing, Queues & Sagas
SECTION 1 — WHY ASYNCHRONY IS UNAVOIDABLE
Real systems cannot be fully synchronous because:
-
users are slow
-
networks are unreliable
-
external services fail
-
workloads spike unpredictably
-
some work is expensive
Therefore:
Any serious backend system is partially asynchronous.
Trying to keep everything synchronous:
-
increases latency
-
reduces availability
-
couples systems tightly
-
causes cascading failures
SECTION 2 — SYNCHRONOUS VS ASYNCHRONOUS BOUNDARIES
Elite engineers are intentional about where async boundaries exist.
Synchronous (User-Facing)
-
validation
-
authorization
-
lightweight reads
-
state checks
Asynchronous (Background)
-
emails / notifications
-
payments confirmation
-
analytics
-
document processing
-
integrations
-
retries
Rule:
If the user doesn’t need the result immediately, it should be async.
SECTION 3 — QUEUES AS LOAD-LEVELERS
Queues exist to:
-
absorb spikes
-
decouple producers from consumers
-
smooth load
-
enable retries
-
provide durability
They turn:
“Do this now”
into
“Do this reliably.”
Queue Mental Model
Producer → Queue → Worker(s)
Key properties:
-
producers don’t care who processes
-
workers don’t care who produced
-
failure is isolated
This separation is critical for scale.
SECTION 4 — DELIVERY SEMANTICS (THE HARD PART)
Every queue system has delivery guarantees:
At-Most-Once
-
fast
-
can lose messages
-
no retries
At-Least-Once
-
messages may repeat
-
must handle duplicates
-
most common in production
Exactly-Once
-
extremely difficult
-
usually simulated via idempotency
Elite Rule
Assume at-least-once delivery and design idempotently.
Never assume “exactly once” unless you deeply understand the system.
SECTION 5 — IDENTITY IN ASYNC SYSTEMS
Every message must have:
-
unique ID
-
correlation ID
-
causation ID (optional but powerful)
Why?
-
tracing
-
deduplication
-
debugging
-
audits
If you can’t trace a message across services, you don’t control the system.
SECTION 6 — EVENT-DRIVEN ARCHITECTURE
Event-driven systems communicate by facts, not commands.
Example:
-
❌ “Charge this payment”
-
✅ “PaymentRequested”
-
✅ “PaymentCompleted”
-
✅ “PaymentFailed”
Events represent things that happened, not things you want to happen.
This distinction matters enormously.
Benefits
-
loose coupling
-
scalability
-
extensibility
-
auditability
Costs
-
complexity
-
eventual consistency
-
harder debugging
Elite engineers accept these tradeoffs intentionally.
SECTION 7 — WORKFLOWS VS EVENTS
Two styles exist:
Choreography (Event-Driven)
-
services react to events
-
no central coordinator
-
highly decoupled
-
harder to reason about
Orchestration (Workflow-Driven)
-
central orchestrator
-
explicit state machine
-
easier correctness
-
more coupling
Elite Rule
Use orchestration for business-critical workflows.
Use events for side effects and extensions.
Payments, bookings, onboarding → orchestration
Analytics, notifications → events
SECTION 8 — SAGAS (DISTRIBUTED TRANSACTIONS)
In distributed systems:
You cannot have global transactions.
Sagas replace them.
Saga Definition
A saga is:
-
a sequence of steps
-
each step has a compensating action
-
system is eventually consistent
Example: Booking Saga
-
Reserve seat
-
Charge payment
-
Confirm booking
If step 2 fails:
- compensate step 1 → release seat
Key Properties
-
each step is idempotent
-
compensations are explicit
-
failures are expected
-
retries are normal
SECTION 9 — FAILURE IS THE DEFAULT STATE
Elite backend engineers assume:
-
workers crash
-
messages duplicate
-
services time out
-
databases slow down
-
networks partition
Therefore systems must:
-
retry safely
-
back off exponentially
-
isolate failures
-
recover automatically
SECTION 10 — RETRIES DONE RIGHT
Retries are dangerous if misused.
Rules:
-
retries must be bounded
-
retries must be idempotent
-
retries must have backoff
-
retries must use jitter
Never retry blindly.
Retry Storms
If many services retry simultaneously → system collapse.
Elite engineers:
-
cap retries
-
use circuit breakers
-
shed load intentionally
SECTION 11 — DEAD LETTER QUEUES (DLQ)
Any system without a DLQ is incomplete.
DLQs are for:
-
poisoned messages
-
repeated failures
-
manual inspection
-
remediation
Elite engineers:
-
monitor DLQ volume
-
alert on growth
-
replay messages safely
SECTION 12 — TIME AS A FIRST-CLASS CONCERN
Workflows span time:
-
minutes
-
hours
-
days
Systems must handle:
-
delayed jobs
-
scheduled retries
-
timeouts
-
expiration
Time introduces complexity — ignoring it causes corruption.
SECTION 13 — OBSERVABILITY IN ASYNC SYSTEMS
You must observe:
-
queue depth
-
processing latency
-
retry counts
-
DLQ size
-
success vs failure rate
Without this:
Your system is operating blind.
SECTION 14 — COMMON ASYNC TRAPS
❌ Fire-and-forget messages
❌ No idempotency
❌ No DLQ
❌ No retries
❌ Long-running workers with no checkpoints
❌ Hidden coupling through shared DBs
These traps cause catastrophic outages.
SECTION 15 — SIGNALS YOU’VE MASTERED ASYNC BACKENDS
You know you’re there when:
-
you naturally split sync vs async paths
-
you design workflows explicitly
-
you assume retries & duplicates
-
you can reason about partial failure
-
you can explain eventual consistency calmly