WHAT IS A DISTRIBUTED SYSTEM?

A distributed system is:

A system that runs on multiple machines and behaves as if it were one.

But the real definition for staff-level engineers:

A distributed system is one where things fail constantly, and your job is to make the failure invisible.

THE 8 FALLACIES OF DISTRIBUTED SYSTEMS

These are foundational truths that every expert internalizes:

The network is reliable
Latency is zero
Bandwidth is infinite
The network is secure
Topology doesn’t change
There is a single administrator
Transport cost is zero
The network is homogeneous

Every distributed system failure is a violation of one of these assumptions.

CONSISTENCY MODELS (REAL ENGINEERING VIEW)

Consistency models define visibility guarantees across nodes.

Strong Consistency

A read after a write ALWAYS returns the latest value.

Used for:

banking
authentication
state machines

Examples:

Spanner
PostgreSQL (single region)

Sequential Consistency

All operations appear in some global order.

Rare in mainstream implementations.

Eventual Consistency

Data becomes consistent over time, not immediately.

Used for:

feeds
notifications
chat
search
analytics

Examples:

DynamoDB
Cassandra
ElasticSearch

Causal Consistency

Preserves cause → effect ordering.

Used for:

real-time collaboration
chat applications

Monotonic Reads

Every read moves forward in time.

Read-Your-Writes

You can always see your latest write.

REPLICATION (LEADER-BASED VS LEADERLESS)

Replication guarantees availability and performance.

Two main categories:

1. Leader-Based Replication

Examples:

PostgreSQL
MySQL
MongoDB (replica sets)

Pros:

strong consistency
simple mental model

Cons:

leader bottleneck
failover complexity

2. Leaderless Replication

Clients write to ANY node.

Examples:

DynamoDB
Cassandra
Riak

Uses quorum techniques:

WRITE QUORUM (W)

READ QUORUM (R)

N = total replicas

Consistency guarantee if:

W + R > N

This is how Amazon’s systems achieve massive durability and availability.

CONSENSUS PROTOCOLS

Consensus = ensuring nodes agree on the same state.

Two major families:

1. Paxos

Complex, academic, hard to implement.

Used by:

Google Chubby Lock Service
Spanner (variant: Paxos + GPS clocks)

2. Raft (Modern Standard)

Much easier to understand and implement.

Used by:

etcd (Kubernetes)
Consul
Nomad
CockroachDB
TiDB

Raft provides:

leader election
log replication
safety guarantees
cluster membership

This is the foundation of modern distributed system control planes.

LEADER ELECTION

Used when you need one authoritative node.

Systems use:

Raft
Zookeeper
etcd

Leader responsibilities:

assign work
manage locks
maintain metadata
coordinate tasks

If a leader dies → new election.

DISTRIBUTED LOCKS

Used when multiple nodes must not mutate shared state simultaneously.

Redis-based Locking (Redlock)

Redlock algorithm by Redis:

acquire locks across multiple Redis nodes
ensure majority success
expire locks automatically

Used by:

Stripe
Shopify
Airbnb

Database Advisory Locks

Example:

SELECT pg_advisory_lock(12345)

Used when you want:

cross-application mutual exclusion
safe critical sections

Zookeeper / etcd Locks

Stronger guarantees than Redis.

Used in:

Kafka partition leadership
Kubernetes scheduling

FAULT TOLERANCE PATTERNS

Staff-level engineers design systems to fail gracefully, not perfectly.

Pattern 1 — Retries

Every distributed call MUST be retried.

Rules:

retry with exponential backoff
jitter to avoid thundering herd
cap retry attempts
detect idempotency violations

Pattern 2 — Circuit Breakers

If a dependency is failing → STOP sending requests.

Circuit States:

This prevents:

cascading failures
retry storms

Pattern 3 — Bulkheads

Partition system resources so that:

failure in one service
does NOT affect others

Inspired by ship design.

Pattern 4 — Timeouts Everywhere

Never block indefinitely.

Every request MUST have:

timeout
cancelation logic
retry policy

Pattern 5 — Load Shedding

Drop low-priority traffic during overload.

Better to fail some requests than ALL requests.

Pattern 6 — Dead Letter Queues (DLQs)

Any message that fails repeatedly → goes to DLQ.

Used for:

payment failures
webhook retries
email bounce workflows

QUEUES VS STREAMS VS PUB/SUB

This is a common interview question — and foundational knowledge.

Queues

point-to-point
message consumed once
ordered guarantees optional
retries + DLQ common

Examples:

SQS
RabbitMQ

Use when:

workers process jobs
order doesn’t matter much
throughput is moderate/high

Streams

append-only logs
multiple consumers reading independently
events are replayable
partitioned for scaling

Examples:

Kafka
Kinesis

Use when:

analytics
event sourcing
real-time dashboards
audit logs

Pub/Sub

fan-out
broadcast model

Examples:

Redis pub/sub
Google Pub/Sub

Use when:

notifications
real-time updates
UI subscriptions

CRDTs (Conflict-Free Replicated Data Types)

This is a rarely understood but extremely powerful concept.

CRDTs allow:

multi-region writes
offline-first apps
conflict-free merges
distributed collaboration

Used in:

Figma
Google Docs
Notion
Multiplayer apps

Types:

G-Counter
PN-Counter
OR-Set
LWW-Register

CRDTs guarantee:

Distributed updates can merge without coordination.

This eliminates the need for:

locks
consensus
conflict resolution

REAL-TIME SYSTEMS (WEBRTC, SOCKETS, PUB/SUB)

Real-time systems need:

low latency
quick state synchronization
message ordering guarantees
presence tracking

Technologies:

WebSockets
WebRTC
QUIC
MQTT

Patterns:

fan-out messaging
presence heartbeat
status synchronization
TURN servers for NAT traversal

This directly applies to your Veterinary Call Flow work:

presence system
reconnect behavior
heartbeat signals
call lifecycle state machine

EVENTUAL CONSISTENCY & COMPENSATING TRANSACTIONS

In distributed systems, you CANNOT have a global ACID transaction.

Instead, you use:

sagas
compensating transactions
orchestration vs choreography

Example:

Booking workflow:

Reserve seat
Charge payment
Confirm reservation

If payment fails → cancel seat reservation.

This is a compensating action.

HOW DISTRIBUTED SYSTEMS FAIL (THE REALITY)

Everything fails eventually.

Failures include:

partial outages
network partitions
slow queries
stale replicas
leader failure
GC pauses
out-of-order messages
duplicate messages

Your job is NOT to prevent failure.

Your job is to make failure harmless.

HOW TOP ENGINEERS DEBUG DISTRIBUTED SYSTEMS

They don’t guess — they follow signals.

They look at:

traces
logs across services
message IDs across queues
correlation IDs
metrics spikes
replica lag
retry storms
DLQ volume

They ask:

Where is state inconsistent?
Which service dropped messages?
Which dependency is slowing everything down?
Is a retry storm happening?
Is a single shard overloaded?

This is expert-level debugging.

THE 8 FALLACIES OF DISTRIBUTED SYSTEMS

CONSISTENCY MODELS (REAL ENGINEERING VIEW)

Strong Consistency​

Sequential Consistency​

Eventual Consistency​

Causal Consistency​

Monotonic Reads​

Read-Your-Writes​

REPLICATION (LEADER-BASED VS LEADERLESS)

1. Leader-Based Replication​

2. Leaderless Replication​

CONSENSUS PROTOCOLS

1. Paxos​

2. Raft (Modern Standard)​

LEADER ELECTION

Systems use:​

DISTRIBUTED LOCKS

Redis-based Locking (Redlock)​

Database Advisory Locks​

Zookeeper / etcd Locks​

FAULT TOLERANCE PATTERNS

Pattern 1 — Retries​

Pattern 2 — Circuit Breakers​

Pattern 3 — Bulkheads​

Pattern 4 — Timeouts Everywhere​

Pattern 5 — Load Shedding​

Pattern 6 — Dead Letter Queues (DLQs)​

QUEUES VS STREAMS VS PUB/SUB

Queues​

Streams​

Pub/Sub​

CRDTs (Conflict-Free Replicated Data Types)

REAL-TIME SYSTEMS (WEBRTC, SOCKETS, PUB/SUB)

Technologies:​

EVENTUAL CONSISTENCY & COMPENSATING TRANSACTIONS

HOW DISTRIBUTED SYSTEMS FAIL (THE REALITY)

HOW TOP ENGINEERS DEBUG DISTRIBUTED SYSTEMS

They look at:​

They ask:​

Strong Consistency

Sequential Consistency

Eventual Consistency

Causal Consistency

Monotonic Reads

Read-Your-Writes

1. Leader-Based Replication

2. Leaderless Replication

1. Paxos

2. Raft (Modern Standard)

Systems use:

Redis-based Locking (Redlock)

Database Advisory Locks

Zookeeper / etcd Locks

Pattern 1 — Retries

Pattern 2 — Circuit Breakers

Pattern 3 — Bulkheads

Pattern 4 — Timeouts Everywhere

Pattern 5 — Load Shedding

Pattern 6 — Dead Letter Queues (DLQs)

Queues

Streams

Pub/Sub

Technologies:

They look at:

They ask: