SECTION 1 — WHAT IS A DISTRIBUTED SYSTEM?
A distributed system is:
A system that runs on multiple machines and behaves as if it were one.
But the real definition for staff-level engineers:
A distributed system is one where things fail constantly, and your job is to make the failure invisible.
SECTION 2 — THE 8 FALLACIES OF DISTRIBUTED SYSTEMS
These are foundational truths that every expert internalizes:
-
The network is reliable
-
Latency is zero
-
Bandwidth is infinite
-
The network is secure
-
Topology doesn’t change
-
There is a single administrator
-
Transport cost is zero
-
The network is homogeneous
Every distributed system failure is a violation of one of these assumptions.
SECTION 3 — CONSISTENCY MODELS (REAL ENGINEERING VIEW)
Consistency models define visibility guarantees across nodes.
Strong Consistency
A read after a write ALWAYS returns the latest value.
Used for:
-
banking
-
authentication
-
state machines
Examples:
-
Spanner
-
PostgreSQL (single region)
Sequential Consistency
All operations appear in some global order.
Rare in mainstream implementations.
Eventual Consistency
Data becomes consistent over time, not immediately.
Used for:
-
feeds
-
notifications
-
chat
-
search
-
analytics
Examples:
-
DynamoDB
-
Cassandra
-
ElasticSearch
Causal Consistency
Preserves cause → effect ordering.
Used for:
-
real-time collaboration
-
chat applications
Monotonic Reads
Every read moves forward in time.
Read-Your-Writes
You can always see your latest write.
SECTION 4 — REPLICATION (LEADER-BASED VS LEADERLESS)
Replication guarantees availability and performance.
Two main categories:
1. Leader-Based Replication
One node = Leader
Others = Followers
Writes → Leader
Reads → Followers (optional)
Examples:
-
PostgreSQL
-
MySQL
-
MongoDB (replica sets)
Pros:
-
strong consistency
-
simple mental model
Cons:
-
leader bottleneck
-
failover complexity
2. Leaderless Replication
Clients write to ANY node.
Examples:
-
DynamoDB
-
Cassandra
-
Riak
Uses quorum techniques:
WRITE QUORUM (W)
READ QUORUM (R)
N = total replicas
Consistency guarantee if:
W + R > N
This is how Amazon’s systems achieve massive durability and availability.
SECTION 5 — CONSENSUS PROTOCOLS
Consensus = ensuring nodes agree on the same state.
Two major families:
1. Paxos
Complex, academic, hard to implement.
Used by:
-
Google Chubby Lock Service
-
Spanner (variant: Paxos + GPS clocks)
2. Raft (Modern Standard)
Much easier to understand and implement.
Used by:
-
etcd (Kubernetes)
-
Consul
-
Nomad
-
CockroachDB
-
TiDB
Raft provides:
-
leader election
-
log replication
-
safety guarantees
-
cluster membership
This is the foundation of modern distributed system control planes.
SECTION 6 — LEADER ELECTION
Used when you need one authoritative node.
Systems use:
-
Raft
-
Zookeeper
-
etcd
Leader responsibilities:
-
assign work
-
manage locks
-
maintain metadata
-
coordinate tasks
If a leader dies → new election.
SECTION 7 — DISTRIBUTED LOCKS
Used when multiple nodes must not mutate shared state simultaneously.
Redis-based Locking (Redlock)
Redlock algorithm by Redis:
-
acquire locks across multiple Redis nodes
-
ensure majority success
-
expire locks automatically
Used by:
-
Stripe
-
Shopify
-
Airbnb
Database Advisory Locks
Example:
SELECT pg_advisory_lock(12345)
Used when you want:
-
cross-application mutual exclusion
-
safe critical sections
Zookeeper / etcd Locks
Stronger guarantees than Redis.
Used in:
-
Kafka partition leadership
-
Kubernetes scheduling
SECTION 8 — FAULT TOLERANCE PATTERNS
Staff-level engineers design systems to fail gracefully, not perfectly.
Pattern 1 — Retries
Every distributed call MUST be retried.
Rules:
-
retry with exponential backoff
-
jitter to avoid thundering herd
-
cap retry attempts
-
detect idempotency violations
Pattern 2 — Circuit Breakers
If a dependency is failing → STOP sending requests.
Circuit States:
Closed → Open → Half-Open → Closed
This prevents:
-
cascading failures
-
retry storms
Pattern 3 — Bulkheads
Partition system resources so that:
-
failure in one service
-
does NOT affect others
Inspired by ship design.
Pattern 4 — Timeouts Everywhere
Never block indefinitely.
Every request MUST have:
-
timeout
-
cancelation logic
-
retry policy
Pattern 5 — Load Shedding
Drop low-priority traffic during overload.
Better to fail some requests than ALL requests.
Pattern 6 — Dead Letter Queues (DLQs)
Any message that fails repeatedly → goes to DLQ.
Used for:
-
payment failures
-
webhook retries
-
email bounce workflows
SECTION 9 — QUEUES VS STREAMS VS PUB/SUB
This is a common interview question — and foundational knowledge.
Queues
-
point-to-point
-
message consumed once
-
ordered guarantees optional
-
retries + DLQ common
Examples:
-
SQS
-
RabbitMQ
Use when:
-
workers process jobs
-
order doesn’t matter much
-
throughput is moderate/high
Streams
-
append-only logs
-
multiple consumers reading independently
-
events are replayable
-
partitioned for scaling
Examples:
-
Kafka
-
Kinesis
Use when:
-
analytics
-
event sourcing
-
real-time dashboards
-
audit logs
Pub/Sub
-
fan-out
-
broadcast model
Examples:
-
Redis pub/sub
-
Google Pub/Sub
Use when:
-
notifications
-
real-time updates
-
UI subscriptions
SECTION 10 — CRDTs (Conflict-Free Replicated Data Types)
This is a rarely understood but extremely powerful concept.
CRDTs allow:
-
multi-region writes
-
offline-first apps
-
conflict-free merges
-
distributed collaboration
Used in:
-
Figma
-
Google Docs
-
Notion
-
Multiplayer apps
Types:
-
G-Counter
-
PN-Counter
-
OR-Set
-
LWW-Register
CRDTs guarantee:
Distributed updates can merge without coordination.
This eliminates the need for:
-
locks
-
consensus
-
conflict resolution
SECTION 11 — REAL-TIME SYSTEMS (WEBRTC, SOCKETS, PUB/SUB)
Real-time systems need:
-
low latency
-
quick state synchronization
-
message ordering guarantees
-
presence tracking
Technologies:
-
WebSockets
-
WebRTC
-
QUIC
-
MQTT
Patterns:
-
fan-out messaging
-
presence heartbeat
-
status synchronization
-
TURN servers for NAT traversal
This directly applies to your Veterinary Call Flow work:
-
presence system
-
reconnect behavior
-
heartbeat signals
-
call lifecycle state machine
SECTION 12 — EVENTUAL CONSISTENCY & COMPENSATING TRANSACTIONS
In distributed systems, you CANNOT have a global ACID transaction.
Instead, you use:
-
sagas
-
compensating transactions
-
orchestration vs choreography
Example:
Booking workflow:
-
Reserve seat
-
Charge payment
-
Confirm reservation
If payment fails → cancel seat reservation.
This is a compensating action.
SECTION 13 — HOW DISTRIBUTED SYSTEMS FAIL (THE REALITY)
Everything fails eventually.
Failures include:
-
partial outages
-
network partitions
-
slow queries
-
stale replicas
-
leader failure
-
GC pauses
-
out-of-order messages
-
duplicate messages
Your job is NOT to prevent failure.
Your job is to make failure harmless.
SECTION 14 — HOW TOP ENGINEERS DEBUG DISTRIBUTED SYSTEMS
They don’t guess — they follow signals.
They look at:
-
traces
-
logs across services
-
message IDs across queues
-
correlation IDs
-
metrics spikes
-
replica lag
-
retry storms
-
DLQ volume
They ask:
-
Where is state inconsistent?
-
Which service dropped messages?
-
Which dependency is slowing everything down?
-
Is a retry storm happening?
-
Is a single shard overloaded?
This is expert-level debugging.