Part III (g) - Tradeoff Calculus for System Design
HARD TRUTH: ARCHITECTURE IS TRADEOFFS, NOT DIAGRAMS
War-story pattern: teams with beautiful diagrams still melt in production because no one wrote down the tradeoffs.
Top engineers make tradeoffs explicit before code starts. They decide what to optimize now, what to defer, and what to protect at all costs.
Your job is not to maximize every dimension. Your job is to pick the right losses deliberately.
THE FIVE-AXIS TRADEOFF MODEL
For each major architecture decision, score each option on:
- Latency
- Consistency
- Cost
- Complexity
- Team velocity
Use -2 to +2:
+2: strong advantage+1: moderate advantage0: neutral-1: moderate cost-2: heavy cost
Example decision: synchronous write path vs async write with queue.
- Sync path: better consistency, worse latency and resilience under spikes.
- Async path: better spike handling, more complexity and eventual consistency risk.
The score does not make the decision for you. It forces you to see the real shape of the decision.
DECISION MATRIX TEMPLATE
Use one matrix per major decision:
- Decision name
- Decision owner
- Date and review date
- Constraints and assumptions
- Options considered
- Five-axis scores
- Risks and mitigations
- Reversal cost
- Final decision and rationale
This converts design discussion from opinion to traceable reasoning.
If six months later the decision looks wrong, the matrix tells you whether assumptions changed or execution failed.
DEFAULT PATTERNS FOR GOOD JUDGMENT
Use these defaults unless data forces a different move:
- Prefer reversible decisions early.
- Keep complexity localized, not distributed across all services.
- Buy strong consistency only where correctness demands it.
- Optimize reliability before micro-optimizing peak performance.
- Keep ownership boundaries clear so teams can evolve independently.
Field rule: defaults are not dogma, but they save teams from expensive early mistakes when uncertainty is high.
FAILURE MODE CHECK
Before finalizing a design, run a quick failure pre-mortem:
- Top 3 failure modes
- Earliest detection signal for each
- Blast radius estimate
- Containment and rollback strategy
- Owner during incident
If this section is weak, your design is not production-ready yet.
A design without failure thinking is an optimistic diagram, not an operating system.
CHANGE TRIGGERS
Every decision must include trigger conditions for re-evaluation:
- Traffic or data growth threshold crossed
- SLO misses for two consecutive releases
- Cost exceeds budget by agreed percentage
- Incident frequency increases beyond baseline
- Team cognitive load blocks delivery
This prevents two classic failure modes:
- Keeping a design too long after constraints changed
- Re-architecting too early without evidence
War-Story Mini-Case: Redis Everywhere Backfired
Timeline:
Week 0: Team adds Redis caching to almost every read endpoint; median latency drops from280msto90ms.Week 2: Stale-order incidents appear in checkout and order-history flows.Week 3: Incident review shows invalidation logic duplicated across four services.Week 4: Cache scope reduced to true hot paths; invalidation ownership moved to one boundary service.Week 6: Trigger policy added: design review required ifp95 > 220msbefore broad caching changes.
Key decisions:
- Rejected broad cache expansion in favor of bounded cache ownership.
- Prioritized correctness over benchmark-driven latency wins.
- Added explicit re-evaluation trigger to prevent reactive architecture changes.
Outcome:
p95settled at130ms.- Correctness incidents dropped sharply, with predictable cache behavior.
OUTPUT ARTIFACT
For every major architecture decision, publish:
- One-page Architecture Decision Matrix
- ADR with rationale and alternatives
- Failure mode check summary
- Review date and change triggers
If you consistently produce these artifacts, architecture quality compounds release after release.