Skip to main content

LEARNING FROM THE BATTLEFIELD

Theory teaches you principles.

Experience teaches you reality.

But you don’t have time to make every mistake yourself.

This section contains real-world case studies from production systems at scale—the successes, failures, and hard-won lessons that separate textbook knowledge from battle-tested wisdom.

These are the stories that made engineers better.


SECTION 1 — THE GREAT OUTAGES

CASE STUDY 1: AWS S3 Outage (2017)

The Incident

Date: February 28, 2017
Duration: 4 hours
Impact: Major internet disruption

What Happened

An engineer was debugging the S3 billing system and ran a command to remove a small number of servers. A typo caused the command to remove a much larger set of servers, including two critical subsystems.

# Intended command (remove few servers)
$ remove-servers --count=5 --subsystem=billing

# Actual command (typo removed wrong servers)
$ remove-servers --count=50 --subsystem=s3-critical

The Cascading Failure

S3 servers removed

Index subsystem goes down

S3 can't locate objects

All S3 GET requests fail

Thousands of websites down

AWS status page also uses S3 → status page down

Unable to communicate status to customers

The Impact

  • Internet slowed globally

  • Slack, Trello, Medium, Quora all down

  • IoT devices stopped working

  • Smart home lights went offline

  • Estimated cost: $150 million

What Went Wrong

  1. Single command had too much power

  2. No confirmation for dangerous operations

  3. No gradual rollout (should remove 1, test, then continue)

  4. Insufficient blast radius limitation

  5. Status page had same dependency

What AWS Fixed

  1. Added confirmation prompts for dangerous operations

  2. Implemented rate limiting on server removal

  3. Created independent status page infrastructure

  4. Added “dry-run” mode for all operational commands

  5. Improved monitoring for subsystem health

  6. Implemented gradual recovery process

The Lesson for You

Design operations so that mistakes are reversible, gradual, and limited in scope.

Apply this:
- Add confirmation for destructive actions
- Implement dry-run modes
- Use gradual rollouts (canary deployments)
- Separate critical dependencies
- Build redundancy into communication systems


CASE STUDY 2: GitHub’s 24-Minute Outage (2018)

The Incident

Date: October 21, 2018
Duration: 24 hours (degraded service)
Impact: Complete GitHub outage

What Happened

Network partition between US East and US West data centers lasted 43 seconds. Both data centers thought the other was down and elected themselves as primary.

The Split Brain Problem

Normal State:
US-East (Primary) ←→ US-West (Replica)

Data replicates East → West

Network Partition (43 seconds):
US-East (Primary) ✗ US-West (now also Primary)
↓ ↓
Writes to DB Writes to DB
↓ ↓
Databases diverge (split-brain)

Network Restored:
US-East ←→ US-West
↓ ↓
Which is the source of truth?
Data conflict!

The Recovery Challenge

GitHub couldn’t simply pick one database—both had valid writes from different users during the partition.

Recovery Process:
1. Put site in maintenance mode (10 minutes after incident)
2. Stop all writes to prevent more data divergence
3. Restore from backup before partition
4. Replay write logs from BOTH data centers
5. Resolve conflicts manually for overlapping writes
6. Verify data integrity
7. Bring systems back online

Time to recovery: 24 hours

What Went Wrong

  1. Inadequate split-brain detection

  2. Both data centers assumed they were correct

  3. No automatic conflict resolution

  4. Recovery was manual and slow

  5. Monitoring didn’t catch partition quickly enough

What GitHub Fixed

  1. Improved network partition detection

  2. Implemented Raft consensus for leader election

  3. Added automatic conflict resolution

  4. Better monitoring for cross-datacenter health

  5. Created runbooks for split-brain scenarios

  6. Regular disaster recovery drills

The Lesson for You

In distributed systems, network partitions WILL happen. Design for it.

Apply this:
- Implement proper consensus algorithms (Raft, Paxos)
- Never assume network is reliable
- Design for split-brain scenarios
- Have automated recovery procedures
- Regular disaster recovery testing
- Read “Designing Data-Intensive Applications”


CASE STUDY 3: Knight Capital’s $440 Million Loss (2012)

The Incident

Date: August 1, 2012
Duration: 45 minutes
Impact: $440 million loss, company bankruptcy

What Happened

Knight Capital deployed new trading software. Due to a deployment error, old code was accidentally reactivated on one server.

The Catastrophic Sequence

07:00 - Deploy new trading software

07:30 - Markets open

One server has OLD code activated

Old code has a "Power Peg" test feature

Feature sends 97 orders per second

Orders execute at market price

Knight buys high, sells low, repeatedly

09:30 - Engineers notice anomaly

Emergency stop triggered

$440 million lost in 45 minutes

The Technical Failure

The old code:

# Old test feature accidentally reactivated
def power_peg_test():
while True:
place_market_order(random_stock())
time.sleep(0.01) # 97 orders/second

Why it happened:
1. Deployment script skipped one server
2. Old code had a test flag still active
3. No verification that all servers deployed correctly
4. No kill switch for runaway trading
5. No position limits enforced in code

What Went Wrong

  1. Incomplete deployment (1 of 8 servers missed)

  2. No deployment verification

  3. Dead code not removed (power peg feature unused but still in codebase)

  4. No circuit breaker for abnormal trading

  5. No real-time monitoring of order volumes

  6. No automated position limits

What the Industry Learned

  1. Deployment verification mandatory

    deploy.sh --verify-all-servers
  2. Remove dead code aggressively

    • If it’s not used, delete it

    • Don’t leave “disabled” features

  3. Implement circuit breakers

    if orders_per_second > THRESHOLD:
    emergency_stop()
    alert_engineers()
  4. Position limits in code

    if total_position > MAX_POSITION:
    reject_order()
  5. Real-time monitoring dashboards

    • Order volume per second

    • Position sizes

    • P&L tracking

The Lesson for You

Deploy safely. Monitor actively. Fail fast.

Apply this:
- Verify deployments completed on ALL servers
- Remove dead code (it WILL hurt you)
- Implement circuit breakers for dangerous operations
- Add limits in code, not just config
- Monitor critical metrics in real-time
- Have kill switches for automated systems


CASE STUDY 4: Cloudflare’s 27-Minute Global Outage (2019)

The Incident

Date: July 2, 2019
Duration: 27 minutes
Impact: 50% of global HTTP requests failed

What Happened

A regex pattern in a WAF rule caused CPU to spike to 100% across all servers, taking down half the internet.

The Deadly Regular Expression

# The problematic regex (simplified)
(?:(?:\"|'|\]|\}|\\|\d|(?:nan|infinity|true|false|null|undefined|symbol|math)|\`|\-|\+)+[)]*;?((?:\s|-|~|!|{}|\|\||\+)*.*(?:.*=.*)))

Why it was deadly:
- Catastrophic backtracking on certain inputs
- Input: x=x=x=x=x=x=x=x=x=x=x=x=x=x=x=x=x=x=x=x=x=x=x=x=x=x=x=x=x!
- Regex engine tries billions of combinations
- Single request takes seconds of CPU time

The Cascading Failure

New WAF rule deployed globally

User sends request matching bad regex

Server CPU spikes to 100%

All requests on that server slow down

Same pattern hits other servers

Global CPU at 100%

Requests timeout

50% of global traffic fails

What Went Wrong

  1. No regex performance testing

  2. Global deployment (no canary)

  3. Insufficient CPU monitoring during deployment

  4. No automatic rollback on performance degradation

  5. Regex complexity not reviewed

What Cloudflare Fixed

  1. Regex performance testing in CI/CD

    test('regex performance', () => {
    const start = Date.now();
    const result = regex.test(evilInput);
    const duration = Date.now() - start;
    expect(duration).toBeLessThan(100); // 100ms max
    });
  2. Canary deployments for rule changes

    • Deploy to 1% of servers

    • Monitor CPU, latency, error rates

    • Gradual rollout if healthy

  3. Automatic rollback triggers

    if (cpu_usage > 90% for 60 seconds) {
    rollback()
    }
  4. CPU budgets for regex

    • Kill regex execution after 50ms

    • Return error rather than hang

  5. Re2 library (guaranteed linear time)

    • No catastrophic backtracking

    • Predictable performance

The Lesson for You

Test performance, not just functionality. Deploy gradually.

Apply this:
- Performance test regex patterns
- Use safe regex libraries (re2)
- Implement timeouts for expensive operations
- Deploy gradually (canary deployments)
- Monitor resource usage during deployments
- Automatic rollback on anomalies


CASE STUDY 5: Slack’s Database Migration (2023)

The Challenge

Migrate from sharded MySQL to Vitess without downtime for millions of users.

The Stakes

  • 100+ million users

  • Zero downtime tolerance

  • Petabytes of data

  • Tens of thousands of queries per second

The Strategy: Dual Writes

Phase 1: Preparation (3 months)
├── Build Vitess cluster
├── Set up replication
└── Test tooling

Phase 2: Shadow Traffic (2 months)
├── Mirror reads to Vitess
├── Compare results
└── Fix discrepancies

Phase 3: Dual Writes (4 months)
├── Write to BOTH MySQL and Vitess
├── Read from MySQL (source of truth)
├── Compare data continuously
└── Fix inconsistencies

Phase 4: Read Cutover (1 month)
├── Shift reads to Vitess gradually
├── 1% → 10% → 50% → 100%
├── Monitor latency and errors
└── Rollback capability maintained

Phase 5: Write Cutover (2 weeks)
├── Make Vitess source of truth
├── Stop dual writes
└── Decommission old MySQL

Total time: 12 months

The Technical Implementation

// Dual write with verification
func WriteUser(user User) error {
// Write to new system (Vitess)
errNew := vitess.Write(user)

// Write to old system (MySQL)
errOld := mysql.Write(user)

// Both must succeed
if errNew != nil || errOld != nil {
return fmt.Errorf("dual write failed")
}

// Async verification
go verifyConsistency(user.ID)

return nil
}

// Gradual read migration
func ReadUser(userID string) (User, error) {
// Traffic splitting
if shouldReadFromVitess(userID) {
user, err := vitess.Read(userID)
if err != nil {
// Fallback to MySQL
return mysql.Read(userID)
}
return user, nil
}

return mysql.Read(userID)
}

// Percentage-based rollout
func shouldReadFromVitess(userID string) bool {
hash := hash(userID)
return (hash % 100) < currentRolloutPercentage
}

What Went Right

  1. Gradual migration (not big bang)

  2. Continuous verification of data consistency

  3. Multiple rollback points

  4. Extensive monitoring at each phase

  5. Dark launch (shadow traffic before real traffic)

  6. Clear success criteria for each phase

The Lesson for You

Large migrations require patience, verification, and rollback plans.

Apply this:
- Never do big-bang migrations
- Implement dual writes for safety
- Verify data consistency continuously
- Use percentage-based rollouts
- Maintain rollback capability throughout
- Have clear success metrics for each phase


SECTION 2 — SCALING VICTORIES

CASE STUDY 6: Discord’s Elixir Migration

The Problem

Discord hit 5 million concurrent users and Rails backend couldn’t keep up.

Symptoms:
- Message latency: 200-500ms
- Server CPU: 80-90%
- Growing infrastructure costs
- Degraded user experience

The Decision

Migrate from Ruby on Rails to Elixir/Phoenix.

Why Elixir?
- Built on Erlang VM (designed for concurrency)
- Actor model (isolated processes)
- Fault tolerance (let it crash philosophy)
- Soft real-time guarantees

The Results

Before (Rails):
- 5 million concurrent users
- 40+ servers
- 200-500ms latency
- Frequent outages

After (Elixir):
- 50+ million concurrent users (10x)
- 12 servers (70% reduction)
- 10-50ms latency (90% improvement)
- Near-zero outages

Cost savings: ~$1 million/year

The Implementation Strategy

Phase 1: Proof of Concept
├── Build message delivery in Elixir
├── Run parallel with Rails
└── Compare performance

Phase 2: Critical Path Migration
├── Migrate message sending
├── Migrate presence system
└── Keep everything else in Rails

Phase 3: Gradual Feature Migration
├── Voice/video calling
├── User profiles
├── Guild management
└── Analytics

Phase 4: Complete Migration
└── Decommission Rails entirely

Key Technical Insights

The Elixir Advantage:

# Each user connection is an isolated process
defmodule UserConnection do
use GenServer

def handle_info({:message, msg}, state) do
# Process runs independently
# Crash doesn't affect others
# Automatic restart by supervisor
send_to_client(msg)
{:noreply, state}
end
end

# Supervisor ensures fault tolerance
children = [
{DynamicSupervisor, strategy: :one_for_one, name: UserSupervisor}
]

# Millions of processes, each isolated
# If one crashes, others unaffected

The Lesson for You

Choose technology that matches your scale requirements, not what’s popular.

Apply this:
- Understand your bottlenecks before choosing solutions
- Consider concurrency model for real-time systems
- Run proof of concept before full migration
- Migrate critical path first, iterate on the rest


CASE STUDY 7: Instagram’s Python at Scale

The Challenge

“Python doesn’t scale” — so how does Instagram serve 2+ billion users with Python?

The Truth

Instagram runs one of the largest Python deployments in the world.

Scale:
- 2+ billion users
- 100+ million photos per day
- Thousands of servers
- Python everywhere (Django)

How They Made Python Scale

1. Master the Bottlenecks (It’s Not Python)

# Bottleneck is rarely Python itself
# Usually:
# - Database queries (fixed with caching)
# - Network I/O (fixed with async)
# - Algorithmic complexity (fixed with better algorithms)

# Example: Feed generation
# Bad: O(n²) algorithm
for post in posts:
for user in followers:
if should_show(user, post):
feeds[user].append(post)

# Good: O(n) with indexing
for post in posts:
followers = follower_index[post.author]
for follower in followers:
feeds[follower].append(post)

2. Aggressive Caching

# Cache layers
┌─────────────────────────┐
│ Client Cache (60s)
├─────────────────────────┤
│ CDN (300s)
├─────────────────────────┤
│ Memcached (3600s)
├─────────────────────────┤
│ Database │
└─────────────────────────┘

# 99% of requests never hit database

3. Asynchronous Processing

# Don't do expensive work in request
# Queue it

@app.route('/post/photo')
def post_photo(request):
photo = request.files['photo']

# Quick: Save to S3, return immediately
photo_url = upload_to_s3(photo)
photo_id = db.create_photo(photo_url)

# Slow: Queue for processing
queue.publish('photo.process', {
'photo_id': photo_id,
'tasks': ['resize', 'filter', 'face_detection']
})

return {'photo_id': photo_id} # Fast response

# Worker handles slow tasks
@worker.subscribe('photo.process')
def process_photo(message):
resize_photo(message['photo_id'])
apply_filters(message['photo_id'])
detect_faces(message['photo_id'])

4. Django Optimization

# Use select_related to avoid N+1 queries
# Bad: 1 + N queries
posts = Post.objects.all()
for post in posts:
print(post.author.username) # New query each time!

# Good: 1 query with JOIN
posts = Post.objects.select_related('author').all()
for post in posts:
print(post.author.username) # No extra query

# Use prefetch_related for many-to-many
posts = Post.objects.prefetch_related('likes').all()

5. Profiling Everything

import cProfile

# Profile slow endpoints
@app.route('/feed')
@profile_this
def get_feed(request):
# Instagram profiles every endpoint
# Identifies bottlenecks
# Optimizes aggressively
pass

The Results

  • Serve 2+ billion users with Python

  • Latency: p99 < 500ms

  • Cost-effective operation

  • Developer velocity remains high

The Lesson for You

Language choice matters less than architecture, caching, and algorithm quality.

Apply this:
- Profile before optimizing
- Cache aggressively
- Use async for I/O
- Optimize database queries
- Don’t blame the language, fix the architecture


CASE STUDY 8: Shopify’s Black Friday/Cyber Monday

The Challenge

Handle 10,000+ requests per second during peak shopping days.

Peak Load Stats (2023)

  • 76 million shoppers

  • $9.3 billion in sales

  • Peak: 11,000 requests/second

  • Zero downtime

How They Prepared

1. Load Testing (3 months before)

# Simulate peak load
# 3x expected traffic
# Identify breaking points

LoadTest.configure do |config|
config.target_rps = 30_000 # 3x expected peak
config.duration = '1 hour'
config.ramp_up = '5 minutes'
end

# Run weekly, fix bottlenecks

2. Database Optimization

# Sharding strategy
# Each merchant on specific shard
# Prevents single DB from overloading

class Merchant < ApplicationRecord
shard_key :id

def shard_id
# 1000 shards
id % 1000
end
end

# Read replicas
# Scale reads horizontally
primary = Database.primary
replicas = Database.replicas # 10+ per shard

# Route reads to replicas
def find_product(id)
replicas.sample.query("SELECT * FROM products WHERE id = ?", id)
end

3. Caching Everything

# Cache hit rate: 99%+
# Layers:
# 1. Browser cache (static assets)
# 2. CDN (Fastly)
# 3. Application cache (Redis)
# 4. Database query cache

# Example: Product page
def product_page(product_id)
cache_key = "product:#{product_id}:v2"

Rails.cache.fetch(cache_key, expires_in: 5.minutes) do
Product.find(product_id).to_json
end
end

4. Queue Everything Non-Critical

# Request time budget: 200ms
# Anything slower → queue

def create_order(params)
# Fast: Create order record
order = Order.create!(params) # 50ms

# Slow: Queue everything else
OrderProcessingJob.perform_later(order.id)
EmailJob.perform_later(order.id)
AnalyticsJob.perform_later(order.id)

# Return quickly
{ order_id: order.id } # Total: 50ms
end

5. Graceful Degradation

# If system overloaded, disable non-critical features

def feature_enabled?(feature)
return false if system_load > 80

case feature
when :recommendations
false # Turn off recommendations under load
when :reviews
false # Turn off reviews under load
when :checkout
true # Always keep checkout working
end
end

6. Real-Time Monitoring

  • Engineers watching dashboards live

  • Automated scaling (if load > 70%, add servers)

  • Kill switches for non-essential features

  • On-call team ready to respond

The Results

Zero downtime during peak shopping period.

The Lesson for You

Peak load requires months of preparation, aggressive caching, and graceful degradation.

Apply this:
- Load test at 3x expected traffic
- Cache everything possible
- Queue non-critical operations
- Plan for graceful degradation
- Monitor in real-time during peak
- Have kill switches ready


SECTION 3 — ARCHITECTURAL TRANSFORMATIONS

CASE STUDY 9: Netflix’s Microservices Journey

The Evolution

2008: Monolith on Physical Servers
- Single large application
- Takes 16 minutes to start
- Deploys once every 2 weeks
- Outages take entire site down

2012: Migration to AWS + Microservices
- 800+ microservices
- Deploy 1000s of times per day
- Independent scaling
- Fault isolation

2024: 1000+ Microservices
- Extreme resilience
- Global scale
- Chaos engineering
- 200+ million subscribers

Why They Did It

Monolith Problems:

┌─────────────────────────────────────┐
│ SINGLE MONOLITH │
│ ┌──────────────────────────────┐ │
│ │ Recommendations │ │
│ │ Streaming │ │
│ │ User Management │ │
│ │ Billing │ │
│ │ Content Management │ │
│ └──────────────────────────────┘ │
└─────────────────────────────────────┘

Problems:
- Deploy ALL or nothing
- Bug in one area breaks everything
- Can't scale parts independently
- Slow development (team conflicts)

Microservices Benefits:

┌──────────────┐  ┌──────────────┐  ┌──────────────┐
│Recommendations│ │ Streaming │ │ Billing │
└──────────────┘ └──────────────┘ └──────────────┘
↓ ↓ ↓
Deploy Deploy Deploy
independently independently independently
↓ ↓ ↓
Scale Scale Scale
independently independently independently

The Migration Strategy

Phase 1: API Gateway

Extract API layer from monolith
Route traffic through gateway
Gradually move endpoints to services

Phase 2: Extract Services One-by-One

Priority order:
1. User authentication (most traffic)
2. Recommendations (frequently changing)
3. Billing (critical, rarely changes)
4. Content management

Phase 3: Data Migration

Each service gets its own database
No shared database between services
Use events for data sync

Key Patterns They Developed

1. Circuit Breaker Pattern

@HystrixCommand(fallbackMethod = "defaultRecommendations")
public List<Movie> getRecommendations(User user) {
// If recommendation service is down,
// return default recommendations
// Don't let failure cascade
}

2. Chaos Engineering

Randomly kill services in production
Test: Does system survive?
Result: Build resilient systems
Tool: Chaos Monkey

3. Service Mesh (Hystrix)

Every service call goes through:
- Circuit breaker
- Retry logic
- Timeout enforcement
- Metrics collection

The Challenges

  1. Distributed Tracing

    • Hard to debug across 1000+ services

    • Solution: Built their own (Zipkin, now open-source)

  2. Data Consistency

    • No ACID transactions across services

    • Solution: Event-driven architecture, eventual consistency

  3. Operational Complexity

    • 1000+ services to monitor

    • Solution: Extensive automation, self-healing systems

The Results

  • Uptime: 99.99%

  • Deploy 4000+ times per day

  • Independent team velocity

  • Survived AWS outages (multi-region)

The Lesson for You

Microservices solve organizational problems, not just technical ones.

When to use microservices:
- Large team (50+ engineers)
- Need independent scaling
- Different parts change at different rates
- Need fault isolation

When NOT to use microservices:
- Small team (< 10 engineers)
- Simple application
- Limited operational expertise
- Tight coupling between domains


CASE STUDY 10: Stripe’s API Versioning

The Challenge

How do you evolve an API used by millions of businesses without breaking them?

The Problem

Stripe processes $1 trillion annually. Any breaking change could:
- Break customer integrations
- Cause financial loss
- Damage reputation
- Trigger legal issues

The Solution: Frozen API Versions

Key Principle: Once an API version is released, it never changes.

API Version 2020-08-27

Frozen forever

No breaking changes allowed

New features → new version

How It Works

1. Customer Pins to Version

curl https://api.stripe.com/v1/charges \
-H "Stripe-Version: 2020-08-27" \
-d amount=2000

2. Stripe Maintains All Versions

// Stripe's backend
function createCharge(data, version) {
switch(version) {
case '2020-08-27':
return createCharge_2020_08_27(data);
case '2022-11-15':
return createCharge_2022_11_15(data);
case '2024-01-01':
return createCharge_2024_01_01(data);
}
}

3. Translation Layer

// Convert between versions
function translate(data, from, to) {
// 2020 format
if (from === '2020-08-27' && to === '2024-01-01') {
return {
amount: data.amount,
currency: data.currency,
// New field in 2024
metadata: data.metadata || {},
// Renamed field
payment_method: data.source
};
}
}

Migration Strategy

Stripe doesn’t force upgrades. They incentivize:

  1. Deprecation Warnings

    {
    "warning": "This API version will be sunset on 2025-01-01",
    "upgrade_guide": "https://stripe.com/docs/upgrades/2024-01-01"
    }
  2. Feature Gating

    • New features only in new versions

    • Customers upgrade to get features

  3. Dashboard Notifications

    • “You’re using an old API version”

    • “Upgrade to get [new feature]”

  4. Gradual Sunset (3+ year notice)

    Year 1: Announce deprecation
    Year 2: Remind customers
    Year 3: Final warning
    Year 4: Sunset (with migration support)

The Technical Cost

  • Maintain 10+ API versions simultaneously

  • Complex translation layer

  • Extensive testing (all versions)

  • Large engineering investment

The Business Value

  • Zero customer breakage

  • Trust and reliability

  • Enterprises choose Stripe because of API stability

  • Competitive advantage

The Lesson for You

API versioning is expensive but necessary for business-critical APIs.

Apply this:
- Version your API from day one
- Never break existing versions
- Provide upgrade paths
- Give long deprecation notices (12+ months)
- Make migrations easy (auto-migration tools)


SECTION 4 — DEBUGGING WAR STORIES

CASE STUDY 11: The Leap Second Bug (Cloudflare, 2017)

The Incident

Date: December 31, 2016, 23:59:60
Duration: Intermittent issues for hours

The Bug

A leap second caused time to go backwards, breaking systems worldwide.

Normal:
23:59:58
23:59:59
00:00:00 ← New year

With Leap Second:
23:59:58
23:59:59
23:59:60 ← Extra second!
00:00:00

Some systems:
23:59:59
00:00:00
23:59:59 ← Time went backwards!

The Impact

// This code broke
if time.Now() > lastProcessedTime {
processEvent()
lastProcessedTime = time.Now()
}

// When time went backwards:
// Events processed twice
// Race conditions appeared
// Locks deadlocked

The Fix

// Use monotonic time for comparisons
start := time.Now()
// ... do work ...
duration := time.Since(start) // Always positive

// Never use wall clock time for ordering
// Use sequence numbers instead
eventID := sequenceNumber++ // Always increasing

The Lesson

Never use wall clock time for ordering events. Use monotonic time or sequence numbers.


CASE STUDY 12: The DNS Mystery (Etsy)

The Incident

Random requests failing. Pattern made no sense.

Symptoms:
- 0.1% of requests fail
- No pattern by user, endpoint, or time
- No correlation with load
- Error: “Cannot resolve hostname”

The Investigation

# Check DNS
dig api.etsy.com ← Works fine

# Check application logs
[ERROR] getaddrinfo: Name or service not known

# Check system
ulimit -n1024 (max file descriptors)

# Count open files
ls -l /proc/[pid]/fd | wc -l1023 files open!

The Root Cause

Linux makes DNS requests using file descriptors.

Application has 1024 FD limit

1023 files/sockets already open

DNS query needs 1 FD

No FDs available

DNS query fails

Random request fails

The Fix

# Increase file descriptor limit
ulimit -n 65535

# Make permanent
echo "* soft nofile 65535" >> /etc/security/limits.conf
echo "* hard nofile 65535" >> /etc/security/limits.conf

The Lesson

System limits (file descriptors, ports, memory) cause mysterious failures.

Check these:

ulimit -a          # All limits
lsof | wc -l # Open files
netstat -an | wc -l # Network connections
free -h # Memory

SECTION 5 — WHAT YOU SHOULD REMEMBER

Universal Lessons from These Stories

1. Design for Failure

  • Network WILL partition

  • Services WILL go down

  • Commands WILL be mistyped

  • Regex WILL catastrophically backtrack

2. Deploy Gradually

  • Start with 1%

  • Monitor closely

  • Have rollback ready

  • Automate rollback triggers

3. Verify Everything

  • Deployments completed

  • Data consistency

  • Performance metrics

  • Resource limits

4. Limit Blast Radius

  • Confirmation for dangerous operations

  • Rate limits on actions

  • Circuit breakers for failures

  • Independent subsystems

5. Monitor Actively

  • Real-time dashboards

  • Automated alerts

  • Anomaly detection

  • Meaningful metrics

6. Have Rollback Plans

  • Every deployment

  • Every migration

  • Every configuration change

  • Every feature flag

7. Remove Dead Code

  • If unused, delete it

  • Don’t disable, remove

  • Technical debt kills

8. Test at Scale

  • Load test at 3x expected

  • Test failure scenarios

  • Test gradual rollouts

  • Test rollbacks


Final Exercise

Study Your Own War Stories

Document your production incidents:

# Incident: [Name]
Date:
Duration:
Impact:

## What Happened
[Narrative timeline]

## Root Cause
[The fundamental issue]

## What We Did Wrong
[Honest assessment]

## What We Fixed
[Permanent fixes]

## What We Learned
[Lessons for future]

Then share with your team. Blameless postmortems build better engineers.


This completes PART XII — Real-World Case Studies & War Stories.

The best engineers learn from others’ mistakes. You now have a decade of hard-won lessons in one document.