Skip to main content

SECTION 1 — WHAT IS A DISTRIBUTED SYSTEM?

A distributed system is:

A system that runs on multiple machines and behaves as if it were one.

But the real definition for staff-level engineers:

A distributed system is one where things fail constantly, and your job is to make the failure invisible.


SECTION 2 — THE 8 FALLACIES OF DISTRIBUTED SYSTEMS

These are foundational truths that every expert internalizes:

  1. The network is reliable

  2. Latency is zero

  3. Bandwidth is infinite

  4. The network is secure

  5. Topology doesn’t change

  6. There is a single administrator

  7. Transport cost is zero

  8. The network is homogeneous

Every distributed system failure is a violation of one of these assumptions.


SECTION 3 — CONSISTENCY MODELS (REAL ENGINEERING VIEW)

Consistency models define visibility guarantees across nodes.


Strong Consistency

A read after a write ALWAYS returns the latest value.

Used for:

  • banking

  • authentication

  • state machines

Examples:

  • Spanner

  • PostgreSQL (single region)


Sequential Consistency

All operations appear in some global order.

Rare in mainstream implementations.


Eventual Consistency

Data becomes consistent over time, not immediately.

Used for:

  • feeds

  • notifications

  • chat

  • search

  • analytics

Examples:

  • DynamoDB

  • Cassandra

  • ElasticSearch


Causal Consistency

Preserves cause → effect ordering.

Used for:

  • real-time collaboration

  • chat applications


Monotonic Reads

Every read moves forward in time.


Read-Your-Writes

You can always see your latest write.


SECTION 4 — REPLICATION (LEADER-BASED VS LEADERLESS)

Replication guarantees availability and performance.

Two main categories:


1. Leader-Based Replication

One node = Leader

Others = Followers

Writes → Leader

Reads → Followers (optional)

Examples:

  • PostgreSQL

  • MySQL

  • MongoDB (replica sets)

Pros:

  • strong consistency

  • simple mental model

Cons:

  • leader bottleneck

  • failover complexity


2. Leaderless Replication

Clients write to ANY node.

Examples:

  • DynamoDB

  • Cassandra

  • Riak

Uses quorum techniques:

WRITE QUORUM (W)

READ QUORUM (R)

N = total replicas

Consistency guarantee if:

W + R > N

This is how Amazon’s systems achieve massive durability and availability.


SECTION 5 — CONSENSUS PROTOCOLS

Consensus = ensuring nodes agree on the same state.

Two major families:


1. Paxos

Complex, academic, hard to implement.

Used by:

  • Google Chubby Lock Service

  • Spanner (variant: Paxos + GPS clocks)


2. Raft (Modern Standard)

Much easier to understand and implement.

Used by:

  • etcd (Kubernetes)

  • Consul

  • Nomad

  • CockroachDB

  • TiDB

Raft provides:

  • leader election

  • log replication

  • safety guarantees

  • cluster membership

This is the foundation of modern distributed system control planes.


SECTION 6 — LEADER ELECTION

Used when you need one authoritative node.

Systems use:

  • Raft

  • Zookeeper

  • etcd

Leader responsibilities:

  • assign work

  • manage locks

  • maintain metadata

  • coordinate tasks

If a leader dies → new election.


SECTION 7 — DISTRIBUTED LOCKS

Used when multiple nodes must not mutate shared state simultaneously.


Redis-based Locking (Redlock)

Redlock algorithm by Redis:

  • acquire locks across multiple Redis nodes

  • ensure majority success

  • expire locks automatically

Used by:

  • Stripe

  • Shopify

  • Airbnb


Database Advisory Locks

Example:

SELECT pg_advisory_lock(12345)

Used when you want:

  • cross-application mutual exclusion

  • safe critical sections


Zookeeper / etcd Locks

Stronger guarantees than Redis.

Used in:

  • Kafka partition leadership

  • Kubernetes scheduling


SECTION 8 — FAULT TOLERANCE PATTERNS

Staff-level engineers design systems to fail gracefully, not perfectly.


Pattern 1 — Retries

Every distributed call MUST be retried.

Rules:

  • retry with exponential backoff

  • jitter to avoid thundering herd

  • cap retry attempts

  • detect idempotency violations


Pattern 2 — Circuit Breakers

If a dependency is failing → STOP sending requests.

Circuit States:

Closed → Open → Half-Open → Closed

This prevents:

  • cascading failures

  • retry storms


Pattern 3 — Bulkheads

Partition system resources so that:

  • failure in one service

  • does NOT affect others

Inspired by ship design.


Pattern 4 — Timeouts Everywhere

Never block indefinitely.

Every request MUST have:

  • timeout

  • cancelation logic

  • retry policy


Pattern 5 — Load Shedding

Drop low-priority traffic during overload.

Better to fail some requests than ALL requests.


Pattern 6 — Dead Letter Queues (DLQs)

Any message that fails repeatedly → goes to DLQ.

Used for:

  • payment failures

  • webhook retries

  • email bounce workflows


SECTION 9 — QUEUES VS STREAMS VS PUB/SUB

This is a common interview question — and foundational knowledge.


Queues

  • point-to-point

  • message consumed once

  • ordered guarantees optional

  • retries + DLQ common

Examples:

  • SQS

  • RabbitMQ

Use when:

  • workers process jobs

  • order doesn’t matter much

  • throughput is moderate/high


Streams

  • append-only logs

  • multiple consumers reading independently

  • events are replayable

  • partitioned for scaling

Examples:

  • Kafka

  • Kinesis

Use when:

  • analytics

  • event sourcing

  • real-time dashboards

  • audit logs


Pub/Sub

  • fan-out

  • broadcast model

Examples:

  • Redis pub/sub

  • Google Pub/Sub

Use when:

  • notifications

  • real-time updates

  • UI subscriptions


SECTION 10 — CRDTs (Conflict-Free Replicated Data Types)

This is a rarely understood but extremely powerful concept.

CRDTs allow:

  • multi-region writes

  • offline-first apps

  • conflict-free merges

  • distributed collaboration

Used in:

  • Figma

  • Google Docs

  • Notion

  • Multiplayer apps

Types:

  • G-Counter

  • PN-Counter

  • OR-Set

  • LWW-Register

CRDTs guarantee:

Distributed updates can merge without coordination.

This eliminates the need for:

  • locks

  • consensus

  • conflict resolution


SECTION 11 — REAL-TIME SYSTEMS (WEBRTC, SOCKETS, PUB/SUB)

Real-time systems need:

  • low latency

  • quick state synchronization

  • message ordering guarantees

  • presence tracking

Technologies:

  • WebSockets

  • WebRTC

  • QUIC

  • MQTT

Patterns:

  • fan-out messaging

  • presence heartbeat

  • status synchronization

  • TURN servers for NAT traversal

This directly applies to your Veterinary Call Flow work:

  • presence system

  • reconnect behavior

  • heartbeat signals

  • call lifecycle state machine


SECTION 12 — EVENTUAL CONSISTENCY & COMPENSATING TRANSACTIONS

In distributed systems, you CANNOT have a global ACID transaction.

Instead, you use:

  • sagas

  • compensating transactions

  • orchestration vs choreography

Example:

Booking workflow:

  1. Reserve seat

  2. Charge payment

  3. Confirm reservation

If payment fails → cancel seat reservation.

This is a compensating action.


SECTION 13 — HOW DISTRIBUTED SYSTEMS FAIL (THE REALITY)

Everything fails eventually.

Failures include:

  • partial outages

  • network partitions

  • slow queries

  • stale replicas

  • leader failure

  • GC pauses

  • out-of-order messages

  • duplicate messages

Your job is NOT to prevent failure.

Your job is to make failure harmless.


SECTION 14 — HOW TOP ENGINEERS DEBUG DISTRIBUTED SYSTEMS

They don’t guess — they follow signals.

They look at:

  • traces

  • logs across services

  • message IDs across queues

  • correlation IDs

  • metrics spikes

  • replica lag

  • retry storms

  • DLQ volume

They ask:

  • Where is state inconsistent?

  • Which service dropped messages?

  • Which dependency is slowing everything down?

  • Is a retry storm happening?

  • Is a single shard overloaded?

This is expert-level debugging.