LEARNING FROM THE BATTLEFIELD
Theory teaches you principles.
Experience teaches you reality.
But you don’t have time to make every mistake yourself.
This section contains real-world case studies from production systems at scale—the successes, failures, and hard-won lessons that separate textbook knowledge from battle-tested wisdom.
These are the stories that made engineers better.
SECTION 1 — THE GREAT OUTAGES
CASE STUDY 1: AWS S3 Outage (2017)
The Incident
Date: February 28, 2017
Duration: 4 hours
Impact: Major internet disruption
What Happened
An engineer was debugging the S3 billing system and ran a command to remove a small number of servers. A typo caused the command to remove a much larger set of servers, including two critical subsystems.
# Intended command (remove few servers)
$ remove-servers --count=5 --subsystem=billing
# Actual command (typo removed wrong servers)
$ remove-servers --count=50 --subsystem=s3-critical
The Cascading Failure
S3 servers removed
↓
Index subsystem goes down
↓
S3 can't locate objects
↓
All S3 GET requests fail
↓
Thousands of websites down
↓
AWS status page also uses S3 → status page down
↓
Unable to communicate status to customers
The Impact
-
Internet slowed globally
-
Slack, Trello, Medium, Quora all down
-
IoT devices stopped working
-
Smart home lights went offline
-
Estimated cost: $150 million
What Went Wrong
-
Single command had too much power
-
No confirmation for dangerous operations
-
No gradual rollout (should remove 1, test, then continue)
-
Insufficient blast radius limitation
-
Status page had same dependency
What AWS Fixed
-
Added confirmation prompts for dangerous operations
-
Implemented rate limiting on server removal
-
Created independent status page infrastructure
-
Added “dry-run” mode for all operational commands
-
Improved monitoring for subsystem health
-
Implemented gradual recovery process
The Lesson for You
Design operations so that mistakes are reversible, gradual, and limited in scope.
Apply this:
- Add confirmation for destructive actions
- Implement dry-run modes
- Use gradual rollouts (canary deployments)
- Separate critical dependencies
- Build redundancy into communication systems
CASE STUDY 2: GitHub’s 24-Minute Outage (2018)
The Incident
Date: October 21, 2018
Duration: 24 hours (degraded service)
Impact: Complete GitHub outage
What Happened
Network partition between US East and US West data centers lasted 43 seconds. Both data centers thought the other was down and elected themselves as primary.
The Split Brain Problem
Normal State:
US-East (Primary) ←→ US-West (Replica)
↓
Data replicates East → West
Network Partition (43 seconds):
US-East (Primary) ✗ US-West (now also Primary)
↓ ↓
Writes to DB Writes to DB
↓ ↓
Databases diverge (split-brain)
Network Restored:
US-East ←→ US-West
↓ ↓
Which is the source of truth?
Data conflict!
The Recovery Challenge
GitHub couldn’t simply pick one database—both had valid writes from different users during the partition.
Recovery Process:
1. Put site in maintenance mode (10 minutes after incident)
2. Stop all writes to prevent more data divergence
3. Restore from backup before partition
4. Replay write logs from BOTH data centers
5. Resolve conflicts manually for overlapping writes
6. Verify data integrity
7. Bring systems back online
Time to recovery: 24 hours
What Went Wrong
-
Inadequate split-brain detection
-
Both data centers assumed they were correct
-
No automatic conflict resolution
-
Recovery was manual and slow
-
Monitoring didn’t catch partition quickly enough
What GitHub Fixed
-
Improved network partition detection
-
Implemented Raft consensus for leader election
-
Added automatic conflict resolution
-
Better monitoring for cross-datacenter health
-
Created runbooks for split-brain scenarios
-
Regular disaster recovery drills
The Lesson for You
In distributed systems, network partitions WILL happen. Design for it.
Apply this:
- Implement proper consensus algorithms (Raft, Paxos)
- Never assume network is reliable
- Design for split-brain scenarios
- Have automated recovery procedures
- Regular disaster recovery testing
- Read “Designing Data-Intensive Applications”
CASE STUDY 3: Knight Capital’s $440 Million Loss (2012)
The Incident
Date: August 1, 2012
Duration: 45 minutes
Impact: $440 million loss, company bankruptcy
What Happened
Knight Capital deployed new trading software. Due to a deployment error, old code was accidentally reactivated on one server.
The Catastrophic Sequence
07:00 - Deploy new trading software
↓
07:30 - Markets open
↓
One server has OLD code activated
↓
Old code has a "Power Peg" test feature
↓
Feature sends 97 orders per second
↓
Orders execute at market price
↓
Knight buys high, sells low, repeatedly
↓
09:30 - Engineers notice anomaly
↓
Emergency stop triggered
↓
$440 million lost in 45 minutes
The Technical Failure
The old code:
# Old test feature accidentally reactivated
def power_peg_test():
while True:
place_market_order(random_stock())
time.sleep(0.01) # 97 orders/second
Why it happened:
1. Deployment script skipped one server
2. Old code had a test flag still active
3. No verification that all servers deployed correctly
4. No kill switch for runaway trading
5. No position limits enforced in code
What Went Wrong
-
Incomplete deployment (1 of 8 servers missed)
-
No deployment verification
-
Dead code not removed (power peg feature unused but still in codebase)
-
No circuit breaker for abnormal trading
-
No real-time monitoring of order volumes
-
No automated position limits
What the Industry Learned
-
Deployment verification mandatory
deploy.sh --verify-all-servers -
Remove dead code aggressively
-
If it’s not used, delete it
-
Don’t leave “disabled” features
-
-
Implement circuit breakers
if orders_per_second > THRESHOLD:
emergency_stop()
alert_engineers() -
Position limits in code
if total_position > MAX_POSITION:
reject_order() -
Real-time monitoring dashboards
-
Order volume per second
-
Position sizes
-
P&L tracking
-
The Lesson for You
Deploy safely. Monitor actively. Fail fast.
Apply this:
- Verify deployments completed on ALL servers
- Remove dead code (it WILL hurt you)
- Implement circuit breakers for dangerous operations
- Add limits in code, not just config
- Monitor critical metrics in real-time
- Have kill switches for automated systems
CASE STUDY 4: Cloudflare’s 27-Minute Global Outage (2019)
The Incident
Date: July 2, 2019
Duration: 27 minutes
Impact: 50% of global HTTP requests failed
What Happened
A regex pattern in a WAF rule caused CPU to spike to 100% across all servers, taking down half the internet.
The Deadly Regular Expression
# The problematic regex (simplified)
(?:(?:\"|'|\]|\}|\\|\d|(?:nan|infinity|true|false|null|undefined|symbol|math)|\`|\-|\+)+[)]*;?((?:\s|-|~|!|{}|\|\||\+)*.*(?:.*=.*)))
Why it was deadly:
- Catastrophic backtracking on certain inputs
- Input: x=x=x=x=x=x=x=x=x=x=x=x=x=x=x=x=x=x=x=x=x=x=x=x=x=x=x=x=x!
- Regex engine tries billions of combinations
- Single request takes seconds of CPU time
The Cascading Failure
New WAF rule deployed globally
↓
User sends request matching bad regex
↓
Server CPU spikes to 100%
↓
All requests on that server slow down
↓
Same pattern hits other servers
↓
Global CPU at 100%
↓
Requests timeout
↓
50% of global traffic fails
What Went Wrong
-
No regex performance testing
-
Global deployment (no canary)
-
Insufficient CPU monitoring during deployment
-
No automatic rollback on performance degradation
-
Regex complexity not reviewed
What Cloudflare Fixed
-
Regex performance testing in CI/CD
test('regex performance', () => {
const start = Date.now();
const result = regex.test(evilInput);
const duration = Date.now() - start;
expect(duration).toBeLessThan(100); // 100ms max
}); -
Canary deployments for rule changes
-
Deploy to 1% of servers
-
Monitor CPU, latency, error rates
-
Gradual rollout if healthy
-
-
Automatic rollback triggers
if (cpu_usage > 90% for 60 seconds) {
rollback()
} -
CPU budgets for regex
-
Kill regex execution after 50ms
-
Return error rather than hang
-
-
Re2 library (guaranteed linear time)
-
No catastrophic backtracking
-
Predictable performance
-
The Lesson for You
Test performance, not just functionality. Deploy gradually.
Apply this:
- Performance test regex patterns
- Use safe regex libraries (re2)
- Implement timeouts for expensive operations
- Deploy gradually (canary deployments)
- Monitor resource usage during deployments
- Automatic rollback on anomalies
CASE STUDY 5: Slack’s Database Migration (2023)
The Challenge
Migrate from sharded MySQL to Vitess without downtime for millions of users.
The Stakes
-
100+ million users
-
Zero downtime tolerance
-
Petabytes of data
-
Tens of thousands of queries per second
The Strategy: Dual Writes
Phase 1: Preparation (3 months)
├── Build Vitess cluster
├── Set up replication
└── Test tooling
Phase 2: Shadow Traffic (2 months)
├── Mirror reads to Vitess
├── Compare results
└── Fix discrepancies
Phase 3: Dual Writes (4 months)
├── Write to BOTH MySQL and Vitess
├── Read from MySQL (source of truth)
├── Compare data continuously
└── Fix inconsistencies
Phase 4: Read Cutover (1 month)
├── Shift reads to Vitess gradually
├── 1% → 10% → 50% → 100%
├── Monitor latency and errors
└── Rollback capability maintained
Phase 5: Write Cutover (2 weeks)
├── Make Vitess source of truth
├── Stop dual writes
└── Decommission old MySQL
Total time: 12 months
The Technical Implementation
// Dual write with verification
func WriteUser(user User) error {
// Write to new system (Vitess)
errNew := vitess.Write(user)
// Write to old system (MySQL)
errOld := mysql.Write(user)
// Both must succeed
if errNew != nil || errOld != nil {
return fmt.Errorf("dual write failed")
}
// Async verification
go verifyConsistency(user.ID)
return nil
}
// Gradual read migration
func ReadUser(userID string) (User, error) {
// Traffic splitting
if shouldReadFromVitess(userID) {
user, err := vitess.Read(userID)
if err != nil {
// Fallback to MySQL
return mysql.Read(userID)
}
return user, nil
}
return mysql.Read(userID)
}
// Percentage-based rollout
func shouldReadFromVitess(userID string) bool {
hash := hash(userID)
return (hash % 100) < currentRolloutPercentage
}
What Went Right
-
Gradual migration (not big bang)
-
Continuous verification of data consistency
-
Multiple rollback points
-
Extensive monitoring at each phase
-
Dark launch (shadow traffic before real traffic)
-
Clear success criteria for each phase
The Lesson for You
Large migrations require patience, verification, and rollback plans.
Apply this:
- Never do big-bang migrations
- Implement dual writes for safety
- Verify data consistency continuously
- Use percentage-based rollouts
- Maintain rollback capability throughout
- Have clear success metrics for each phase
SECTION 2 — SCALING VICTORIES
CASE STUDY 6: Discord’s Elixir Migration
The Problem
Discord hit 5 million concurrent users and Rails backend couldn’t keep up.
Symptoms:
- Message latency: 200-500ms
- Server CPU: 80-90%
- Growing infrastructure costs
- Degraded user experience
The Decision
Migrate from Ruby on Rails to Elixir/Phoenix.
Why Elixir?
- Built on Erlang VM (designed for concurrency)
- Actor model (isolated processes)
- Fault tolerance (let it crash philosophy)
- Soft real-time guarantees
The Results
Before (Rails):
- 5 million concurrent users
- 40+ servers
- 200-500ms latency
- Frequent outages
After (Elixir):
- 50+ million concurrent users (10x)
- 12 servers (70% reduction)
- 10-50ms latency (90% improvement)
- Near-zero outages
Cost savings: ~$1 million/year
The Implementation Strategy
Phase 1: Proof of Concept
├── Build message delivery in Elixir
├── Run parallel with Rails
└── Compare performance
Phase 2: Critical Path Migration
├── Migrate message sending
├── Migrate presence system
└── Keep everything else in Rails
Phase 3: Gradual Feature Migration
├── Voice/video calling
├── User profiles
├── Guild management
└── Analytics
Phase 4: Complete Migration
└── Decommission Rails entirely
Key Technical Insights
The Elixir Advantage:
# Each user connection is an isolated process
defmodule UserConnection do
use GenServer
def handle_info({:message, msg}, state) do
# Process runs independently
# Crash doesn't affect others
# Automatic restart by supervisor
send_to_client(msg)
{:noreply, state}
end
end
# Supervisor ensures fault tolerance
children = [
{DynamicSupervisor, strategy: :one_for_one, name: UserSupervisor}
]
# Millions of processes, each isolated
# If one crashes, others unaffected
The Lesson for You
Choose technology that matches your scale requirements, not what’s popular.
Apply this:
- Understand your bottlenecks before choosing solutions
- Consider concurrency model for real-time systems
- Run proof of concept before full migration
- Migrate critical path first, iterate on the rest
CASE STUDY 7: Instagram’s Python at Scale
The Challenge
“Python doesn’t scale” — so how does Instagram serve 2+ billion users with Python?
The Truth
Instagram runs one of the largest Python deployments in the world.
Scale:
- 2+ billion users
- 100+ million photos per day
- Thousands of servers
- Python everywhere (Django)
How They Made Python Scale
1. Master the Bottlenecks (It’s Not Python)
# Bottleneck is rarely Python itself
# Usually:
# - Database queries (fixed with caching)
# - Network I/O (fixed with async)
# - Algorithmic complexity (fixed with better algorithms)
# Example: Feed generation
# Bad: O(n²) algorithm
for post in posts:
for user in followers:
if should_show(user, post):
feeds[user].append(post)
# Good: O(n) with indexing
for post in posts:
followers = follower_index[post.author]
for follower in followers:
feeds[follower].append(post)
2. Aggressive Caching
# Cache layers
┌─────────────────────────┐
│ Client Cache (60s) │
├─────────────────────────┤
│ CDN (300s) │
├─────────────────────────┤
│ Memcached (3600s) │
├─────────────────────────┤
│ Database │
└─────────────────────────┘
# 99% of requests never hit database
3. Asynchronous Processing
# Don't do expensive work in request
# Queue it
@app.route('/post/photo')
def post_photo(request):
photo = request.files['photo']
# Quick: Save to S3, return immediately
photo_url = upload_to_s3(photo)
photo_id = db.create_photo(photo_url)
# Slow: Queue for processing
queue.publish('photo.process', {
'photo_id': photo_id,
'tasks': ['resize', 'filter', 'face_detection']
})
return {'photo_id': photo_id} # Fast response
# Worker handles slow tasks
@worker.subscribe('photo.process')
def process_photo(message):
resize_photo(message['photo_id'])
apply_filters(message['photo_id'])
detect_faces(message['photo_id'])
4. Django Optimization
# Use select_related to avoid N+1 queries
# Bad: 1 + N queries
posts = Post.objects.all()
for post in posts:
print(post.author.username) # New query each time!
# Good: 1 query with JOIN
posts = Post.objects.select_related('author').all()
for post in posts:
print(post.author.username) # No extra query
# Use prefetch_related for many-to-many
posts = Post.objects.prefetch_related('likes').all()
5. Profiling Everything
import cProfile
# Profile slow endpoints
@app.route('/feed')
@profile_this
def get_feed(request):
# Instagram profiles every endpoint
# Identifies bottlenecks
# Optimizes aggressively
pass
The Results
-
Serve 2+ billion users with Python
-
Latency: p99 < 500ms
-
Cost-effective operation
-
Developer velocity remains high
The Lesson for You
Language choice matters less than architecture, caching, and algorithm quality.
Apply this:
- Profile before optimizing
- Cache aggressively
- Use async for I/O
- Optimize database queries
- Don’t blame the language, fix the architecture
CASE STUDY 8: Shopify’s Black Friday/Cyber Monday
The Challenge
Handle 10,000+ requests per second during peak shopping days.
Peak Load Stats (2023)
-
76 million shoppers
-
$9.3 billion in sales
-
Peak: 11,000 requests/second
-
Zero downtime
How They Prepared
1. Load Testing (3 months before)
# Simulate peak load
# 3x expected traffic
# Identify breaking points
LoadTest.configure do |config|
config.target_rps = 30_000 # 3x expected peak
config.duration = '1 hour'
config.ramp_up = '5 minutes'
end
# Run weekly, fix bottlenecks
2. Database Optimization
# Sharding strategy
# Each merchant on specific shard
# Prevents single DB from overloading
class Merchant < ApplicationRecord
shard_key :id
def shard_id
# 1000 shards
id % 1000
end
end
# Read replicas
# Scale reads horizontally
primary = Database.primary
replicas = Database.replicas # 10+ per shard
# Route reads to replicas
def find_product(id)
replicas.sample.query("SELECT * FROM products WHERE id = ?", id)
end
3. Caching Everything
# Cache hit rate: 99%+
# Layers:
# 1. Browser cache (static assets)
# 2. CDN (Fastly)
# 3. Application cache (Redis)
# 4. Database query cache
# Example: Product page
def product_page(product_id)
cache_key = "product:#{product_id}:v2"
Rails.cache.fetch(cache_key, expires_in: 5.minutes) do
Product.find(product_id).to_json
end
end
4. Queue Everything Non-Critical
# Request time budget: 200ms
# Anything slower → queue
def create_order(params)
# Fast: Create order record
order = Order.create!(params) # 50ms
# Slow: Queue everything else
OrderProcessingJob.perform_later(order.id)
EmailJob.perform_later(order.id)
AnalyticsJob.perform_later(order.id)
# Return quickly
{ order_id: order.id } # Total: 50ms
end
5. Graceful Degradation
# If system overloaded, disable non-critical features
def feature_enabled?(feature)
return false if system_load > 80
case feature
when :recommendations
false # Turn off recommendations under load
when :reviews
false # Turn off reviews under load
when :checkout
true # Always keep checkout working
end
end
6. Real-Time Monitoring
-
Engineers watching dashboards live
-
Automated scaling (if load > 70%, add servers)
-
Kill switches for non-essential features
-
On-call team ready to respond
The Results
Zero downtime during peak shopping period.
The Lesson for You
Peak load requires months of preparation, aggressive caching, and graceful degradation.
Apply this:
- Load test at 3x expected traffic
- Cache everything possible
- Queue non-critical operations
- Plan for graceful degradation
- Monitor in real-time during peak
- Have kill switches ready
SECTION 3 — ARCHITECTURAL TRANSFORMATIONS
CASE STUDY 9: Netflix’s Microservices Journey
The Evolution
2008: Monolith on Physical Servers
- Single large application
- Takes 16 minutes to start
- Deploys once every 2 weeks
- Outages take entire site down
2012: Migration to AWS + Microservices
- 800+ microservices
- Deploy 1000s of times per day
- Independent scaling
- Fault isolation
2024: 1000+ Microservices
- Extreme resilience
- Global scale
- Chaos engineering
- 200+ million subscribers
Why They Did It
Monolith Problems:
┌─────────────────────────────────────┐
│ SINGLE MONOLITH │
│ ┌──────────────────────────────┐ │
│ │ Recommendations │ │
│ │ Streaming │ │
│ │ User Management │ │
│ │ Billing │ │
│ │ Content Management │ │
│ └──────────────────────────────┘ │
└─────────────────────────────────────┘
Problems:
- Deploy ALL or nothing
- Bug in one area breaks everything
- Can't scale parts independently
- Slow development (team conflicts)
Microservices Benefits:
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│Recommendations│ │ Streaming │ │ Billing │
└──────────────┘ └──────────────┘ └──────────────┘
↓ ↓ ↓
Deploy Deploy Deploy
independently independently independently
↓ ↓ ↓
Scale Scale Scale
independently independently independently
The Migration Strategy
Phase 1: API Gateway
Extract API layer from monolith
Route traffic through gateway
Gradually move endpoints to services
Phase 2: Extract Services One-by-One
Priority order:
1. User authentication (most traffic)
2. Recommendations (frequently changing)
3. Billing (critical, rarely changes)
4. Content management
Phase 3: Data Migration
Each service gets its own database
No shared database between services
Use events for data sync
Key Patterns They Developed
1. Circuit Breaker Pattern
@HystrixCommand(fallbackMethod = "defaultRecommendations")
public List<Movie> getRecommendations(User user) {
// If recommendation service is down,
// return default recommendations
// Don't let failure cascade
}
2. Chaos Engineering
Randomly kill services in production
Test: Does system survive?
Result: Build resilient systems
Tool: Chaos Monkey
3. Service Mesh (Hystrix)
Every service call goes through:
- Circuit breaker
- Retry logic
- Timeout enforcement
- Metrics collection
The Challenges
-
Distributed Tracing
-
Hard to debug across 1000+ services
-
Solution: Built their own (Zipkin, now open-source)
-
-
Data Consistency
-
No ACID transactions across services
-
Solution: Event-driven architecture, eventual consistency
-
-
Operational Complexity
-
1000+ services to monitor
-
Solution: Extensive automation, self-healing systems
-
The Results
-
Uptime: 99.99%
-
Deploy 4000+ times per day
-
Independent team velocity
-
Survived AWS outages (multi-region)
The Lesson for You
Microservices solve organizational problems, not just technical ones.
When to use microservices:
- Large team (50+ engineers)
- Need independent scaling
- Different parts change at different rates
- Need fault isolation
When NOT to use microservices:
- Small team (< 10 engineers)
- Simple application
- Limited operational expertise
- Tight coupling between domains
CASE STUDY 10: Stripe’s API Versioning
The Challenge
How do you evolve an API used by millions of businesses without breaking them?
The Problem
Stripe processes $1 trillion annually. Any breaking change could:
- Break customer integrations
- Cause financial loss
- Damage reputation
- Trigger legal issues
The Solution: Frozen API Versions
Key Principle: Once an API version is released, it never changes.
API Version 2020-08-27
↓
Frozen forever
↓
No breaking changes allowed
↓
New features → new version
How It Works
1. Customer Pins to Version
curl https://api.stripe.com/v1/charges \
-H "Stripe-Version: 2020-08-27" \
-d amount=2000
2. Stripe Maintains All Versions
// Stripe's backend
function createCharge(data, version) {
switch(version) {
case '2020-08-27':
return createCharge_2020_08_27(data);
case '2022-11-15':
return createCharge_2022_11_15(data);
case '2024-01-01':
return createCharge_2024_01_01(data);
}
}
3. Translation Layer
// Convert between versions
function translate(data, from, to) {
// 2020 format
if (from === '2020-08-27' && to === '2024-01-01') {
return {
amount: data.amount,
currency: data.currency,
// New field in 2024
metadata: data.metadata || {},
// Renamed field
payment_method: data.source
};
}
}
Migration Strategy
Stripe doesn’t force upgrades. They incentivize:
-
Deprecation Warnings
{
"warning": "This API version will be sunset on 2025-01-01",
"upgrade_guide": "https://stripe.com/docs/upgrades/2024-01-01"
} -
Feature Gating
-
New features only in new versions
-
Customers upgrade to get features
-
-
Dashboard Notifications
-
“You’re using an old API version”
-
“Upgrade to get [new feature]”
-
-
Gradual Sunset (3+ year notice)
Year 1: Announce deprecation
Year 2: Remind customers
Year 3: Final warning
Year 4: Sunset (with migration support)
The Technical Cost
-
Maintain 10+ API versions simultaneously
-
Complex translation layer
-
Extensive testing (all versions)
-
Large engineering investment
The Business Value
-
Zero customer breakage
-
Trust and reliability
-
Enterprises choose Stripe because of API stability
-
Competitive advantage
The Lesson for You
API versioning is expensive but necessary for business-critical APIs.
Apply this:
- Version your API from day one
- Never break existing versions
- Provide upgrade paths
- Give long deprecation notices (12+ months)
- Make migrations easy (auto-migration tools)
SECTION 4 — DEBUGGING WAR STORIES
CASE STUDY 11: The Leap Second Bug (Cloudflare, 2017)
The Incident
Date: December 31, 2016, 23:59:60
Duration: Intermittent issues for hours
The Bug
A leap second caused time to go backwards, breaking systems worldwide.
Normal:
23:59:58
23:59:59
00:00:00 ← New year
With Leap Second:
23:59:58
23:59:59
23:59:60 ← Extra second!
00:00:00
Some systems:
23:59:59
00:00:00
23:59:59 ← Time went backwards!
The Impact
// This code broke
if time.Now() > lastProcessedTime {
processEvent()
lastProcessedTime = time.Now()
}
// When time went backwards:
// Events processed twice
// Race conditions appeared
// Locks deadlocked
The Fix
// Use monotonic time for comparisons
start := time.Now()
// ... do work ...
duration := time.Since(start) // Always positive
// Never use wall clock time for ordering
// Use sequence numbers instead
eventID := sequenceNumber++ // Always increasing
The Lesson
Never use wall clock time for ordering events. Use monotonic time or sequence numbers.
CASE STUDY 12: The DNS Mystery (Etsy)
The Incident
Random requests failing. Pattern made no sense.
Symptoms:
- 0.1% of requests fail
- No pattern by user, endpoint, or time
- No correlation with load
- Error: “Cannot resolve hostname”
The Investigation
# Check DNS
dig api.etsy.com ← Works fine
# Check application logs
[ERROR] getaddrinfo: Name or service not known
# Check system
ulimit -n ← 1024 (max file descriptors)
# Count open files
ls -l /proc/[pid]/fd | wc -l ← 1023 files open!
The Root Cause
Linux makes DNS requests using file descriptors.
Application has 1024 FD limit
↓
1023 files/sockets already open
↓
DNS query needs 1 FD
↓
No FDs available
↓
DNS query fails
↓
Random request fails
The Fix
# Increase file descriptor limit
ulimit -n 65535
# Make permanent
echo "* soft nofile 65535" >> /etc/security/limits.conf
echo "* hard nofile 65535" >> /etc/security/limits.conf
The Lesson
System limits (file descriptors, ports, memory) cause mysterious failures.
Check these:
ulimit -a # All limits
lsof | wc -l # Open files
netstat -an | wc -l # Network connections
free -h # Memory
SECTION 5 — WHAT YOU SHOULD REMEMBER
Universal Lessons from These Stories
1. Design for Failure
-
Network WILL partition
-
Services WILL go down
-
Commands WILL be mistyped
-
Regex WILL catastrophically backtrack
2. Deploy Gradually
-
Start with 1%
-
Monitor closely
-
Have rollback ready
-
Automate rollback triggers
3. Verify Everything
-
Deployments completed
-
Data consistency
-
Performance metrics
-
Resource limits
4. Limit Blast Radius
-
Confirmation for dangerous operations
-
Rate limits on actions
-
Circuit breakers for failures
-
Independent subsystems
5. Monitor Actively
-
Real-time dashboards
-
Automated alerts
-
Anomaly detection
-
Meaningful metrics
6. Have Rollback Plans
-
Every deployment
-
Every migration
-
Every configuration change
-
Every feature flag
7. Remove Dead Code
-
If unused, delete it
-
Don’t disable, remove
-
Technical debt kills
8. Test at Scale
-
Load test at 3x expected
-
Test failure scenarios
-
Test gradual rollouts
-
Test rollbacks
Final Exercise
Study Your Own War Stories
Document your production incidents:
# Incident: [Name]
Date:
Duration:
Impact:
## What Happened
[Narrative timeline]
## Root Cause
[The fundamental issue]
## What We Did Wrong
[Honest assessment]
## What We Fixed
[Permanent fixes]
## What We Learned
[Lessons for future]
Then share with your team. Blameless postmortems build better engineers.
This completes PART XII — Real-World Case Studies & War Stories.
The best engineers learn from others’ mistakes. You now have a decade of hard-won lessons in one document.