LEARNING FROM THE BATTLEFIELD

Theory teaches you principles.

Experience teaches you reality.

But you don’t have time to make every mistake yourself.

This section contains real-world case studies from production systems at scale—the successes, failures, and hard-won lessons that separate textbook knowledge from battle-tested wisdom.

These are the stories that made engineers better.

THE GREAT OUTAGES

CASE STUDY 1: AWS S3 Outage (2017)

The Incident

Date: February 28, 2017
Duration: 4 hours
Impact: Major internet disruption

What Happened

An engineer was debugging the S3 billing system and ran a command to remove a small number of servers. A typo caused the command to remove a much larger set of servers, including two critical subsystems.

# Intended command (remove few servers)
$ remove-servers --count=5 --subsystem=billing

# Actual command (typo removed wrong servers)
$ remove-servers --count=50 --subsystem=s3-critical

The Cascading Failure

S3 servers removed
    ↓
Index subsystem goes down
    ↓
S3 can't locate objects
    ↓
All S3 GET requests fail
    ↓
Thousands of websites down
    ↓
AWS status page also uses S3 → status page down
    ↓
Unable to communicate status to customers

The Impact

Internet slowed globally
Slack, Trello, Medium, Quora all down
IoT devices stopped working
Smart home lights went offline
Estimated cost: $150 million

What Went Wrong

Single command had too much power
No confirmation for dangerous operations
No gradual rollout (should remove 1, test, then continue)
Insufficient blast radius limitation
Status page had same dependency

What AWS Fixed

Added confirmation prompts for dangerous operations
Implemented rate limiting on server removal
Created independent status page infrastructure
Added “dry-run” mode for all operational commands
Improved monitoring for subsystem health
Implemented gradual recovery process

The Lesson for You

Design operations so that mistakes are reversible, gradual, and limited in scope.

Apply this:
- Add confirmation for destructive actions
- Implement dry-run modes
- Use gradual rollouts (canary deployments)
- Separate critical dependencies
- Build redundancy into communication systems

CASE STUDY 2: GitHub’s 24-Minute Outage (2018)

The Incident

Date: October 21, 2018
Duration: 24 hours (degraded service)
Impact: Complete GitHub outage

What Happened

Network partition between US East and US West data centers lasted 43 seconds. Both data centers thought the other was down and elected themselves as primary.

The Split Brain Problem

Normal State:
US-East (Primary) ←→ US-West (Replica)
     ↓
Data replicates East → West

Network Partition (43 seconds):
US-East (Primary) ✗ US-West (now also Primary)
     ↓                    ↓
Writes to DB         Writes to DB
     ↓                    ↓
Databases diverge (split-brain)

Network Restored:
US-East ←→ US-West
   ↓          ↓
Which is the source of truth?
Data conflict!

The Recovery Challenge

GitHub couldn’t simply pick one database—both had valid writes from different users during the partition.

Recovery Process:
1. Put site in maintenance mode (10 minutes after incident)
2. Stop all writes to prevent more data divergence
3. Restore from backup before partition
4. Replay write logs from BOTH data centers
5. Resolve conflicts manually for overlapping writes
6. Verify data integrity
7. Bring systems back online

Time to recovery: 24 hours

What Went Wrong

Inadequate split-brain detection
Both data centers assumed they were correct
No automatic conflict resolution
Recovery was manual and slow
Monitoring didn’t catch partition quickly enough

What GitHub Fixed

Improved network partition detection
Implemented Raft consensus for leader election
Added automatic conflict resolution
Better monitoring for cross-datacenter health
Created runbooks for split-brain scenarios
Regular disaster recovery drills

The Lesson for You

In distributed systems, network partitions WILL happen. Design for it.

Apply this:
- Implement proper consensus algorithms (Raft, Paxos)
- Never assume network is reliable
- Design for split-brain scenarios
- Have automated recovery procedures
- Regular disaster recovery testing
- Read “Designing Data-Intensive Applications”

CASE STUDY 3: Knight Capital’s $440 Million Loss (2012)

The Incident

Date: August 1, 2012
Duration: 45 minutes
Impact: $440 million loss, company bankruptcy

What Happened

Knight Capital deployed new trading software. Due to a deployment error, old code was accidentally reactivated on one server.

The Catastrophic Sequence

07:00 - Deploy new trading software
      ↓
07:30 - Markets open
      ↓
      One server has OLD code activated
      ↓
      Old code has a "Power Peg" test feature
      ↓
      Feature sends 97 orders per second
      ↓
      Orders execute at market price
      ↓
      Knight buys high, sells low, repeatedly
      ↓
09:30 - Engineers notice anomaly
      ↓
      Emergency stop triggered
      ↓
      $440 million lost in 45 minutes

The Technical Failure

The old code:

# Old test feature accidentally reactivated
def power_peg_test():
    while True:
        place_market_order(random_stock())
        time.sleep(0.01)  # 97 orders/second

Why it happened:
1. Deployment script skipped one server
2. Old code had a test flag still active
3. No verification that all servers deployed correctly
4. No kill switch for runaway trading
5. No position limits enforced in code

What Went Wrong

Incomplete deployment (1 of 8 servers missed)
No deployment verification
Dead code not removed (power peg feature unused but still in codebase)
No circuit breaker for abnormal trading
No real-time monitoring of order volumes
No automated position limits

What the Industry Learned

Deployment verification mandatory
```
deploy.sh --verify-all-servers
```
Remove dead code aggressively
- If it’s not used, delete it
- Don’t leave “disabled” features

Implement circuit breakers

if orders_per_second > THRESHOLD:
    emergency_stop()
    alert_engineers()

Position limits in code

if total_position > MAX_POSITION:
    reject_order()

Real-time monitoring dashboards
- Order volume per second
- Position sizes
- P&L tracking

The Lesson for You

Deploy safely. Monitor actively. Fail fast.

Apply this:
- Verify deployments completed on ALL servers
- Remove dead code (it WILL hurt you)
- Implement circuit breakers for dangerous operations
- Add limits in code, not just config
- Monitor critical metrics in real-time
- Have kill switches for automated systems

CASE STUDY 4: Cloudflare’s 27-Minute Global Outage (2019)

The Incident

Date: July 2, 2019
Duration: 27 minutes
Impact: 50% of global HTTP requests failed

What Happened

A regex pattern in a WAF rule caused CPU to spike to 100% across all servers, taking down half the internet.

The Deadly Regular Expression

# The problematic regex (simplified)
(?:(?:\"|'|\]|\}|\\|\d|(?:nan|infinity|true|false|null|undefined|symbol|math)|\`|\-|\+)+[)]*;?((?:\s|-|~|!|{}|\|\||\+)*.*(?:.*=.*)))

Why it was deadly:
- Catastrophic backtracking on certain inputs
- Input: x=x=x=x=x=x=x=x=x=x=x=x=x=x=x=x=x=x=x=x=x=x=x=x=x=x=x=x=x!
- Regex engine tries billions of combinations
- Single request takes seconds of CPU time

The Cascading Failure

New WAF rule deployed globally
    ↓
User sends request matching bad regex
    ↓
Server CPU spikes to 100%
    ↓
All requests on that server slow down
    ↓
Same pattern hits other servers
    ↓
Global CPU at 100%
    ↓
Requests timeout
    ↓
50% of global traffic fails

What Went Wrong

No regex performance testing
Global deployment (no canary)
Insufficient CPU monitoring during deployment
No automatic rollback on performance degradation
Regex complexity not reviewed

What Cloudflare Fixed

Regex performance testing in CI/CD

test('regex performance', () => {
  const start = Date.now();
  const result = regex.test(evilInput);
  const duration = Date.now() - start;
  expect(duration).toBeLessThan(100); // 100ms max
});

Canary deployments for rule changes
- Deploy to 1% of servers
- Monitor CPU, latency, error rates
- Gradual rollout if healthy

Automatic rollback triggers

if (cpu_usage > 90% for 60 seconds) {
  rollback()
}

CPU budgets for regex
- Kill regex execution after 50ms
- Return error rather than hang
Re2 library (guaranteed linear time)
- No catastrophic backtracking
- Predictable performance

The Lesson for You

Test performance, not just functionality. Deploy gradually.

Apply this:
- Performance test regex patterns
- Use safe regex libraries (re2)
- Implement timeouts for expensive operations
- Deploy gradually (canary deployments)
- Monitor resource usage during deployments
- Automatic rollback on anomalies

CASE STUDY 5: Slack’s Database Migration (2023)

The Challenge

Migrate from sharded MySQL to Vitess without downtime for millions of users.

The Stakes

100+ million users
Zero downtime tolerance
Petabytes of data
Tens of thousands of queries per second

The Strategy: Dual Writes

Phase 1: Preparation (3 months)
├── Build Vitess cluster
├── Set up replication
└── Test tooling

Phase 2: Shadow Traffic (2 months)
├── Mirror reads to Vitess
├── Compare results
└── Fix discrepancies

Phase 3: Dual Writes (4 months)
├── Write to BOTH MySQL and Vitess
├── Read from MySQL (source of truth)
├── Compare data continuously
└── Fix inconsistencies

Phase 4: Read Cutover (1 month)
├── Shift reads to Vitess gradually
├── 1% → 10% → 50% → 100%
├── Monitor latency and errors
└── Rollback capability maintained

Phase 5: Write Cutover (2 weeks)
├── Make Vitess source of truth
├── Stop dual writes
└── Decommission old MySQL

Total time: 12 months

The Technical Implementation

// Dual write with verification
func WriteUser(user User) error {
    // Write to new system (Vitess)
    errNew := vitess.Write(user)

    // Write to old system (MySQL)
    errOld := mysql.Write(user)

    // Both must succeed
    if errNew != nil || errOld != nil {
        return fmt.Errorf("dual write failed")
    }

    // Async verification
    go verifyConsistency(user.ID)

    return nil
}

// Gradual read migration
func ReadUser(userID string) (User, error) {
    // Traffic splitting
    if shouldReadFromVitess(userID) {
        user, err := vitess.Read(userID)
        if err != nil {
            // Fallback to MySQL
            return mysql.Read(userID)
        }
        return user, nil
    }

    return mysql.Read(userID)
}

// Percentage-based rollout
func shouldReadFromVitess(userID string) bool {
    hash := hash(userID)
    return (hash % 100) < currentRolloutPercentage
}

What Went Right

Gradual migration (not big bang)
Continuous verification of data consistency
Multiple rollback points
Extensive monitoring at each phase
Dark launch (shadow traffic before real traffic)
Clear success criteria for each phase

The Lesson for You

Large migrations require patience, verification, and rollback plans.

Apply this:
- Never do big-bang migrations
- Implement dual writes for safety
- Verify data consistency continuously
- Use percentage-based rollouts
- Maintain rollback capability throughout
- Have clear success metrics for each phase

SCALING VICTORIES

CASE STUDY 6: Discord’s Elixir Migration

The Problem

Discord hit 5 million concurrent users and Rails backend couldn’t keep up.

Symptoms:
- Message latency: 200-500ms
- Server CPU: 80-90%
- Growing infrastructure costs
- Degraded user experience

The Decision

Migrate from Ruby on Rails to Elixir/Phoenix.

Why Elixir?
- Built on Erlang VM (designed for concurrency)
- Actor model (isolated processes)
- Fault tolerance (let it crash philosophy)
- Soft real-time guarantees

The Results

Before (Rails):
- 5 million concurrent users
- 40+ servers
- 200-500ms latency
- Frequent outages

After (Elixir):
- 50+ million concurrent users (10x)
- 12 servers (70% reduction)
- 10-50ms latency (90% improvement)
- Near-zero outages

Cost savings: ~$1 million/year

The Implementation Strategy

Phase 1: Proof of Concept
├── Build message delivery in Elixir
├── Run parallel with Rails
└── Compare performance

Phase 2: Critical Path Migration
├── Migrate message sending
├── Migrate presence system
└── Keep everything else in Rails

Phase 3: Gradual Feature Migration
├── Voice/video calling
├── User profiles
├── Guild management
└── Analytics

Phase 4: Complete Migration
└── Decommission Rails entirely

Key Technical Insights

The Elixir Advantage:

# Each user connection is an isolated process
defmodule UserConnection do
  use GenServer

  def handle_info({:message, msg}, state) do
    # Process runs independently
    # Crash doesn't affect others
    # Automatic restart by supervisor
    send_to_client(msg)
    {:noreply, state}
  end
end

# Supervisor ensures fault tolerance
children = [
  {DynamicSupervisor, strategy: :one_for_one, name: UserSupervisor}
]

# Millions of processes, each isolated
# If one crashes, others unaffected

The Lesson for You

Choose technology that matches your scale requirements, not what’s popular.

Apply this:
- Understand your bottlenecks before choosing solutions
- Consider concurrency model for real-time systems
- Run proof of concept before full migration
- Migrate critical path first, iterate on the rest

CASE STUDY 7: Instagram’s Python at Scale

The Challenge

“Python doesn’t scale” — so how does Instagram serve 2+ billion users with Python?

The Truth

Instagram runs one of the largest Python deployments in the world.

Scale:
- 2+ billion users
- 100+ million photos per day
- Thousands of servers
- Python everywhere (Django)

How They Made Python Scale

1. Master the Bottlenecks (It’s Not Python)

# Bottleneck is rarely Python itself
# Usually:
# - Database queries (fixed with caching)
# - Network I/O (fixed with async)
# - Algorithmic complexity (fixed with better algorithms)

# Example: Feed generation
# Bad: O(n²) algorithm
for post in posts:
    for user in followers:
        if should_show(user, post):
            feeds[user].append(post)

# Good: O(n) with indexing
for post in posts:
    followers = follower_index[post.author]
    for follower in followers:
        feeds[follower].append(post)

2. Aggressive Caching

# Cache layers
┌─────────────────────────┐
│ Client Cache (60s)      │
├─────────────────────────┤
│ CDN (300s)              │
├─────────────────────────┤
│ Memcached (3600s)       │
├─────────────────────────┤
│ Database                │
└─────────────────────────┘

# 99% of requests never hit database

3. Asynchronous Processing

# Don't do expensive work in request
# Queue it

@app.route('/post/photo')
def post_photo(request):
    photo = request.files['photo']

    # Quick: Save to S3, return immediately
    photo_url = upload_to_s3(photo)
    photo_id = db.create_photo(photo_url)

    # Slow: Queue for processing
    queue.publish('photo.process', {
        'photo_id': photo_id,
        'tasks': ['resize', 'filter', 'face_detection']
    })

    return {'photo_id': photo_id}  # Fast response

# Worker handles slow tasks
@worker.subscribe('photo.process')
def process_photo(message):
    resize_photo(message['photo_id'])
    apply_filters(message['photo_id'])
    detect_faces(message['photo_id'])

4. Django Optimization

# Use select_related to avoid N+1 queries
# Bad: 1 + N queries
posts = Post.objects.all()
for post in posts:
    print(post.author.username)  # New query each time!

# Good: 1 query with JOIN
posts = Post.objects.select_related('author').all()
for post in posts:
    print(post.author.username)  # No extra query

# Use prefetch_related for many-to-many
posts = Post.objects.prefetch_related('likes').all()

5. Profiling Everything

import cProfile

# Profile slow endpoints
@app.route('/feed')
@profile_this
def get_feed(request):
    # Instagram profiles every endpoint
    # Identifies bottlenecks
    # Optimizes aggressively
    pass

The Results

Serve 2+ billion users with Python
Latency: p99 < 500ms
Cost-effective operation
Developer velocity remains high

The Lesson for You

Language choice matters less than architecture, caching, and algorithm quality.

Apply this:
- Profile before optimizing
- Cache aggressively
- Use async for I/O
- Optimize database queries
- Don’t blame the language, fix the architecture

CASE STUDY 8: Shopify’s Black Friday/Cyber Monday

The Challenge

Handle 10,000+ requests per second during peak shopping days.

Peak Load Stats (2023)

76 million shoppers
$9.3 billion in sales
Peak: 11,000 requests/second
Zero downtime

How They Prepared

1. Load Testing (3 months before)

# Simulate peak load
# 3x expected traffic
# Identify breaking points

LoadTest.configure do |config|
  config.target_rps = 30_000  # 3x expected peak
  config.duration = '1 hour'
  config.ramp_up = '5 minutes'
end

# Run weekly, fix bottlenecks

2. Database Optimization

# Sharding strategy
# Each merchant on specific shard
# Prevents single DB from overloading

class Merchant < ApplicationRecord
  shard_key :id

  def shard_id
    # 1000 shards
    id % 1000
  end
end

# Read replicas
# Scale reads horizontally
primary = Database.primary
replicas = Database.replicas  # 10+ per shard

# Route reads to replicas
def find_product(id)
  replicas.sample.query("SELECT * FROM products WHERE id = ?", id)
end

3. Caching Everything

# Cache hit rate: 99%+
# Layers:
# 1. Browser cache (static assets)
# 2. CDN (Fastly)
# 3. Application cache (Redis)
# 4. Database query cache

# Example: Product page
def product_page(product_id)
  cache_key = "product:#{product_id}:v2"

  Rails.cache.fetch(cache_key, expires_in: 5.minutes) do
    Product.find(product_id).to_json
  end
end

4. Queue Everything Non-Critical

# Request time budget: 200ms
# Anything slower → queue

def create_order(params)
  # Fast: Create order record
  order = Order.create!(params)  # 50ms

  # Slow: Queue everything else
  OrderProcessingJob.perform_later(order.id)
  EmailJob.perform_later(order.id)
  AnalyticsJob.perform_later(order.id)

  # Return quickly
  { order_id: order.id }  # Total: 50ms
end

5. Graceful Degradation

# If system overloaded, disable non-critical features

def feature_enabled?(feature)
  return false if system_load > 80

  case feature
  when :recommendations
    false  # Turn off recommendations under load
  when :reviews
    false  # Turn off reviews under load
  when :checkout
    true   # Always keep checkout working
  end
end

6. Real-Time Monitoring

Engineers watching dashboards live
Automated scaling (if load > 70%, add servers)
Kill switches for non-essential features
On-call team ready to respond

The Results

Zero downtime during peak shopping period.

The Lesson for You

Peak load requires months of preparation, aggressive caching, and graceful degradation.

Apply this:
- Load test at 3x expected traffic
- Cache everything possible
- Queue non-critical operations
- Plan for graceful degradation
- Monitor in real-time during peak
- Have kill switches ready

ARCHITECTURAL TRANSFORMATIONS

CASE STUDY 9: Netflix’s Microservices Journey

The Evolution

2008: Monolith on Physical Servers
- Single large application
- Takes 16 minutes to start
- Deploys once every 2 weeks
- Outages take entire site down

2012: Migration to AWS + Microservices
- 800+ microservices
- Deploy 1000s of times per day
- Independent scaling
- Fault isolation

2024: 1000+ Microservices
- Extreme resilience
- Global scale
- Chaos engineering
- 200+ million subscribers

Why They Did It

Monolith Problems:

┌─────────────────────────────────────┐
│         SINGLE MONOLITH             │
│  ┌──────────────────────────────┐   │
│  │ Recommendations              │   │
│  │ Streaming                    │   │
│  │ User Management              │   │
│  │ Billing                      │   │
│  │ Content Management           │   │
│  └──────────────────────────────┘   │
└─────────────────────────────────────┘

Problems:
- Deploy ALL or nothing
- Bug in one area breaks everything
- Can't scale parts independently
- Slow development (team conflicts)

Microservices Benefits:

┌──────────────┐  ┌──────────────┐  ┌──────────────┐
│Recommendations│  │  Streaming   │  │    Billing   │
└──────────────┘  └──────────────┘  └──────────────┘
     ↓                  ↓                  ↓
  Deploy              Deploy             Deploy
  independently      independently     independently
     ↓                  ↓                  ↓
  Scale              Scale              Scale
  independently      independently     independently

The Migration Strategy

Phase 1: API Gateway

Extract API layer from monolith
Route traffic through gateway
Gradually move endpoints to services

Phase 2: Extract Services One-by-One

Priority order:
User authentication (most traffic)
Recommendations (frequently changing)
Billing (critical, rarely changes)
Content management

Phase 3: Data Migration

Each service gets its own database
No shared database between services
Use events for data sync

Key Patterns They Developed

1. Circuit Breaker Pattern

@HystrixCommand(fallbackMethod = "defaultRecommendations")
public List<Movie> getRecommendations(User user) {
    // If recommendation service is down,
    // return default recommendations
    // Don't let failure cascade
}

2. Chaos Engineering

Randomly kill services in production
Test: Does system survive?
Result: Build resilient systems
Tool: Chaos Monkey

3. Service Mesh (Hystrix)

Every service call goes through:
- Circuit breaker
- Retry logic
- Timeout enforcement
- Metrics collection

The Challenges

Distributed Tracing
- Hard to debug across 1000+ services
- Solution: Built their own (Zipkin, now open-source)
Data Consistency
- No ACID transactions across services
- Solution: Event-driven architecture, eventual consistency
Operational Complexity
- 1000+ services to monitor
- Solution: Extensive automation, self-healing systems

The Results

Uptime: 99.99%
Deploy 4000+ times per day
Independent team velocity
Survived AWS outages (multi-region)

The Lesson for You

Microservices solve organizational problems, not just technical ones.

When to use microservices:
- Large team (50+ engineers)
- Need independent scaling
- Different parts change at different rates
- Need fault isolation

When NOT to use microservices:
- Small team (< 10 engineers)
- Simple application
- Limited operational expertise
- Tight coupling between domains

CASE STUDY 10: Stripe’s API Versioning

The Challenge

How do you evolve an API used by millions of businesses without breaking them?

The Problem

Stripe processes $1 trillion annually. Any breaking change could:
- Break customer integrations
- Cause financial loss
- Damage reputation
- Trigger legal issues

The Solution: Frozen API Versions

Key Principle: Once an API version is released, it never changes.

API Version 2020-08-27
    ↓
Frozen forever
    ↓
No breaking changes allowed
    ↓
New features → new version

How It Works

1. Customer Pins to Version

curl https://api.stripe.com/v1/charges \
  -H "Stripe-Version: 2020-08-27" \
  -d amount=2000

2. Stripe Maintains All Versions

// Stripe's backend
function createCharge(data, version) {
  switch(version) {
    case '2020-08-27':
      return createCharge_2020_08_27(data);
    case '2022-11-15':
      return createCharge_2022_11_15(data);
    case '2024-01-01':
      return createCharge_2024_01_01(data);
  }
}

3. Translation Layer

// Convert between versions
function translate(data, from, to) {
  // 2020 format
  if (from === '2020-08-27' && to === '2024-01-01') {
    return {
      amount: data.amount,
      currency: data.currency,
      // New field in 2024
      metadata: data.metadata || {},
      // Renamed field
      payment_method: data.source
    };
  }
}

Migration Strategy

Stripe doesn’t force upgrades. They incentivize:

Deprecation Warnings

{
  "warning": "This API version will be sunset on 2025-01-01",
  "upgrade_guide": "https://stripe.com/docs/upgrades/2024-01-01"
}

Feature Gating
- New features only in new versions
- Customers upgrade to get features
Dashboard Notifications
- “You’re using an old API version”
- “Upgrade to get [new feature]”

Gradual Sunset (3+ year notice)

Year 1: Announce deprecation
Year 2: Remind customers
Year 3: Final warning
Year 4: Sunset (with migration support)

The Technical Cost

Maintain 10+ API versions simultaneously
Complex translation layer
Extensive testing (all versions)
Large engineering investment

The Business Value

Zero customer breakage
Trust and reliability
Enterprises choose Stripe because of API stability
Competitive advantage

The Lesson for You

API versioning is expensive but necessary for business-critical APIs.

Apply this:
- Version your API from day one
- Never break existing versions
- Provide upgrade paths
- Give long deprecation notices (12+ months)
- Make migrations easy (auto-migration tools)

DEBUGGING WAR STORIES

CASE STUDY 11: The Leap Second Bug (Cloudflare, 2017)

The Incident

Date: December 31, 2016, 23:59:60
Duration: Intermittent issues for hours

The Bug

A leap second caused time to go backwards, breaking systems worldwide.

Normal:
59:58
59:59
00:00  ← New year

With Leap Second:
59:58
59:59
59:60  ← Extra second!
00:00

Some systems:
59:59
00:00
59:59  ← Time went backwards!

The Impact

// This code broke
if time.Now() > lastProcessedTime {
    processEvent()
    lastProcessedTime = time.Now()
}

// When time went backwards:
// Events processed twice
// Race conditions appeared
// Locks deadlocked

The Fix

// Use monotonic time for comparisons
start := time.Now()
// ... do work ...
duration := time.Since(start)  // Always positive

// Never use wall clock time for ordering
// Use sequence numbers instead
eventID := sequenceNumber++  // Always increasing

The Lesson

Never use wall clock time for ordering events. Use monotonic time or sequence numbers.

CASE STUDY 12: The DNS Mystery (Etsy)

The Incident

Random requests failing. Pattern made no sense.

Symptoms:
- 0.1% of requests fail
- No pattern by user, endpoint, or time
- No correlation with load
- Error: “Cannot resolve hostname”

The Investigation

# Check DNS
dig api.etsy.com  ← Works fine

# Check application logs
[ERROR] getaddrinfo: Name or service not known

# Check system
ulimit -n  ← 1024 (max file descriptors)

# Count open files
ls -l /proc/[pid]/fd | wc -l  ← 1023 files open!

The Root Cause

Linux makes DNS requests using file descriptors.

Application has 1024 FD limit
    ↓
1023 files/sockets already open
    ↓
DNS query needs 1 FD
    ↓
No FDs available
    ↓
DNS query fails
    ↓
Random request fails

The Fix

# Increase file descriptor limit
ulimit -n 65535

# Make permanent
echo "* soft nofile 65535" >> /etc/security/limits.conf
echo "* hard nofile 65535" >> /etc/security/limits.conf

The Lesson

System limits (file descriptors, ports, memory) cause mysterious failures.

Check these:

ulimit -a          # All limits
lsof | wc -l       # Open files
netstat -an | wc -l  # Network connections
free -h            # Memory

WHAT YOU SHOULD REMEMBER

Universal Lessons from These Stories

1. Design for Failure

Network WILL partition
Services WILL go down
Commands WILL be mistyped
Regex WILL catastrophically backtrack

2. Deploy Gradually

Start with 1%
Monitor closely
Have rollback ready
Automate rollback triggers

3. Verify Everything

Deployments completed
Data consistency
Performance metrics
Resource limits

4. Limit Blast Radius

Confirmation for dangerous operations
Rate limits on actions
Circuit breakers for failures
Independent subsystems

5. Monitor Actively

Real-time dashboards
Automated alerts
Anomaly detection
Meaningful metrics

6. Have Rollback Plans

Every deployment
Every migration
Every configuration change
Every feature flag

7. Remove Dead Code

If unused, delete it
Don’t disable, remove
Technical debt kills

8. Test at Scale

Load test at 3x expected
Test failure scenarios
Test gradual rollouts
Test rollbacks

Final Exercise

Study Your Own War Stories

Document your production incidents:

# Incident: [Name]
Date:
Duration:
Impact:

## What Happened
[Narrative timeline]

## Root Cause
[The fundamental issue]

## What We Did Wrong
[Honest assessment]

## What We Fixed
[Permanent fixes]

## What We Learned
[Lessons for future]

Then share with your team. Blameless postmortems build better engineers.

This completes PART XII — Real-World Case Studies & War Stories.

The best engineers learn from others’ mistakes. You now have a decade of hard-won lessons in one document.

THE GREAT OUTAGES

CASE STUDY 1: AWS S3 Outage (2017)​

The Incident​

What Happened​

The Cascading Failure​

The Impact​

What Went Wrong​

What AWS Fixed​

The Lesson for You​

CASE STUDY 2: GitHub’s 24-Minute Outage (2018)​

The Incident​

What Happened​

The Split Brain Problem​

The Recovery Challenge​

What Went Wrong​

What GitHub Fixed​

The Lesson for You​

CASE STUDY 3: Knight Capital’s $440 Million Loss (2012)​

The Incident​

What Happened​

The Catastrophic Sequence​

The Technical Failure​

What Went Wrong​

What the Industry Learned​

The Lesson for You​

CASE STUDY 4: Cloudflare’s 27-Minute Global Outage (2019)​

The Incident​

What Happened​

The Deadly Regular Expression​

The Cascading Failure​

What Went Wrong​

What Cloudflare Fixed​

The Lesson for You​

CASE STUDY 5: Slack’s Database Migration (2023)​

The Challenge​

The Stakes​

The Strategy: Dual Writes​

The Technical Implementation​

What Went Right​

The Lesson for You​

SCALING VICTORIES

CASE STUDY 6: Discord’s Elixir Migration​

The Problem​

The Decision​

The Results​

The Implementation Strategy​

Key Technical Insights​

The Lesson for You​

CASE STUDY 7: Instagram’s Python at Scale​

The Challenge​

The Truth​

How They Made Python Scale​

The Results​

The Lesson for You​

CASE STUDY 8: Shopify’s Black Friday/Cyber Monday​

The Challenge​

Peak Load Stats (2023)​

How They Prepared​

The Results​

The Lesson for You​

ARCHITECTURAL TRANSFORMATIONS

CASE STUDY 9: Netflix’s Microservices Journey​

The Evolution​

Why They Did It​

The Migration Strategy​

Key Patterns They Developed​

The Challenges​

The Results​

The Lesson for You​

CASE STUDY 10: Stripe’s API Versioning​

The Challenge​

The Problem​

The Solution: Frozen API Versions​

How It Works​

Migration Strategy​

The Technical Cost​

The Business Value​

The Lesson for You​

DEBUGGING WAR STORIES

CASE STUDY 11: The Leap Second Bug (Cloudflare, 2017)​

CASE STUDY 1: AWS S3 Outage (2017)

The Incident

What Happened

The Cascading Failure

The Impact

What Went Wrong

What AWS Fixed

The Lesson for You

CASE STUDY 2: GitHub’s 24-Minute Outage (2018)

The Incident

What Happened

The Split Brain Problem

The Recovery Challenge

What Went Wrong

What GitHub Fixed

The Lesson for You

CASE STUDY 3: Knight Capital’s $440 Million Loss (2012)

The Incident

What Happened

The Catastrophic Sequence

The Technical Failure

What Went Wrong

What the Industry Learned

The Lesson for You

CASE STUDY 4: Cloudflare’s 27-Minute Global Outage (2019)

The Incident

What Happened

The Deadly Regular Expression

The Cascading Failure

What Went Wrong

What Cloudflare Fixed

The Lesson for You

CASE STUDY 5: Slack’s Database Migration (2023)

The Challenge

The Stakes

The Strategy: Dual Writes

The Technical Implementation

What Went Right

The Lesson for You

CASE STUDY 6: Discord’s Elixir Migration

The Problem

The Decision

The Results

The Implementation Strategy

Key Technical Insights

The Lesson for You

CASE STUDY 7: Instagram’s Python at Scale

The Challenge

The Truth

How They Made Python Scale

The Results

The Lesson for You

CASE STUDY 8: Shopify’s Black Friday/Cyber Monday

The Challenge

Peak Load Stats (2023)

How They Prepared

The Results

The Lesson for You

CASE STUDY 9: Netflix’s Microservices Journey

The Evolution

Why They Did It

The Migration Strategy

Key Patterns They Developed

The Challenges

The Results

The Lesson for You

CASE STUDY 10: Stripe’s API Versioning

The Challenge

The Problem

The Solution: Frozen API Versions

How It Works

Migration Strategy

The Technical Cost

The Business Value

The Lesson for You

CASE STUDY 11: The Leap Second Bug (Cloudflare, 2017)

The Incident

The Bug

The Impact

The Fix