Skip to main content

Async Processing, Queues & Sagas

SECTION 1 — WHY ASYNCHRONY IS UNAVOIDABLE

Real systems cannot be fully synchronous because:

  • users are slow

  • networks are unreliable

  • external services fail

  • workloads spike unpredictably

  • some work is expensive

Therefore:

Any serious backend system is partially asynchronous.

Trying to keep everything synchronous:

  • increases latency

  • reduces availability

  • couples systems tightly

  • causes cascading failures


SECTION 2 — SYNCHRONOUS VS ASYNCHRONOUS BOUNDARIES

Elite engineers are intentional about where async boundaries exist.

Synchronous (User-Facing)

  • validation

  • authorization

  • lightweight reads

  • state checks

Asynchronous (Background)

  • emails / notifications

  • payments confirmation

  • analytics

  • document processing

  • integrations

  • retries

Rule:

If the user doesn’t need the result immediately, it should be async.


SECTION 3 — QUEUES AS LOAD-LEVELERS

Queues exist to:

  • absorb spikes

  • decouple producers from consumers

  • smooth load

  • enable retries

  • provide durability

They turn:

“Do this now”

into

“Do this reliably.”


Queue Mental Model

Producer → Queue → Worker(s)

Key properties:

  • producers don’t care who processes

  • workers don’t care who produced

  • failure is isolated

This separation is critical for scale.


SECTION 4 — DELIVERY SEMANTICS (THE HARD PART)

Every queue system has delivery guarantees:

At-Most-Once

  • fast

  • can lose messages

  • no retries

At-Least-Once

  • messages may repeat

  • must handle duplicates

  • most common in production

Exactly-Once

  • extremely difficult

  • usually simulated via idempotency


Elite Rule

Assume at-least-once delivery and design idempotently.

Never assume “exactly once” unless you deeply understand the system.


SECTION 5 — IDENTITY IN ASYNC SYSTEMS

Every message must have:

  • unique ID

  • correlation ID

  • causation ID (optional but powerful)

Why?

  • tracing

  • deduplication

  • debugging

  • audits

If you can’t trace a message across services, you don’t control the system.


SECTION 6 — EVENT-DRIVEN ARCHITECTURE

Event-driven systems communicate by facts, not commands.

Example:

  • ❌ “Charge this payment”

  • ✅ “PaymentRequested”

  • ✅ “PaymentCompleted”

  • ✅ “PaymentFailed”

Events represent things that happened, not things you want to happen.

This distinction matters enormously.


Benefits

  • loose coupling

  • scalability

  • extensibility

  • auditability

Costs

  • complexity

  • eventual consistency

  • harder debugging

Elite engineers accept these tradeoffs intentionally.


SECTION 7 — WORKFLOWS VS EVENTS

Two styles exist:

Choreography (Event-Driven)

  • services react to events

  • no central coordinator

  • highly decoupled

  • harder to reason about

Orchestration (Workflow-Driven)

  • central orchestrator

  • explicit state machine

  • easier correctness

  • more coupling


Elite Rule

Use orchestration for business-critical workflows.

Use events for side effects and extensions.

Payments, bookings, onboarding → orchestration

Analytics, notifications → events


SECTION 8 — SAGAS (DISTRIBUTED TRANSACTIONS)

In distributed systems:

You cannot have global transactions.

Sagas replace them.


Saga Definition

A saga is:

  • a sequence of steps

  • each step has a compensating action

  • system is eventually consistent


Example: Booking Saga

  1. Reserve seat

  2. Charge payment

  3. Confirm booking

If step 2 fails:

  • compensate step 1 → release seat

Key Properties

  • each step is idempotent

  • compensations are explicit

  • failures are expected

  • retries are normal


SECTION 9 — FAILURE IS THE DEFAULT STATE

Elite backend engineers assume:

  • workers crash

  • messages duplicate

  • services time out

  • databases slow down

  • networks partition

Therefore systems must:

  • retry safely

  • back off exponentially

  • isolate failures

  • recover automatically


SECTION 10 — RETRIES DONE RIGHT

Retries are dangerous if misused.

Rules:

  • retries must be bounded

  • retries must be idempotent

  • retries must have backoff

  • retries must use jitter

Never retry blindly.


Retry Storms

If many services retry simultaneously → system collapse.

Elite engineers:

  • cap retries

  • use circuit breakers

  • shed load intentionally


SECTION 11 — DEAD LETTER QUEUES (DLQ)

Any system without a DLQ is incomplete.

DLQs are for:

  • poisoned messages

  • repeated failures

  • manual inspection

  • remediation

Elite engineers:

  • monitor DLQ volume

  • alert on growth

  • replay messages safely


SECTION 12 — TIME AS A FIRST-CLASS CONCERN

Workflows span time:

  • minutes

  • hours

  • days

Systems must handle:

  • delayed jobs

  • scheduled retries

  • timeouts

  • expiration

Time introduces complexity — ignoring it causes corruption.


SECTION 13 — OBSERVABILITY IN ASYNC SYSTEMS

You must observe:

  • queue depth

  • processing latency

  • retry counts

  • DLQ size

  • success vs failure rate

Without this:

Your system is operating blind.


SECTION 14 — COMMON ASYNC TRAPS

❌ Fire-and-forget messages

❌ No idempotency

❌ No DLQ

❌ No retries

❌ Long-running workers with no checkpoints

❌ Hidden coupling through shared DBs

These traps cause catastrophic outages.


SECTION 15 — SIGNALS YOU’VE MASTERED ASYNC BACKENDS

You know you’re there when:

  • you naturally split sync vs async paths

  • you design workflows explicitly

  • you assume retries & duplicates

  • you can reason about partial failure

  • you can explain eventual consistency calmly