Part VIII (d) — Execution Artifacts (Design Docs, ADRs, Postmortems, Launch)
OUTPUT AT SENIOR LEVEL IS DOCUMENTED DECISION-MAKING
Seniors don't just ship code.
They ship:
-
clarity
-
alignment
-
reversible decisions
-
safe rollouts
Artifacts are how you create leverage.
When To Reach For Which Artifact
Use the smallest artifact that still creates sufficient clarity:
| Situation | Best Artifact |
|---|---|
| small local change | ticket or PR description |
| multi-service or risky change | design doc |
| durable architecture choice | ADR |
| incident learning | postmortem |
| risky release | launch checklist |
| repeated operational task | runbook |
Teams become document-heavy when they use the same artifact for every situation. They become chaotic when they use none.
GOLD-STANDARD DESIGN DOC (TEMPLATE)
Include:
-
Problem + user impact
-
Goals / non-goals
-
Constraints
-
Proposed solution (with diagrams)
-
Alternatives + tradeoffs
-
Failure modes
-
Observability plan
-
Rollout + rollback
-
Open questions
-
Owner + approver
When to write a design doc: New feature spanning multiple services. Architecture change. Migration. Anything that affects more than one team or has significant risk. For small, localized changes, a ticket description may suffice.
Full template with example filled-in sections:
# Design Doc: Real-time Collaboration for Document Editor
## 1. Problem & User Impact
Users editing the same document see stale content. Edits overwrite each other.
Impact: 40% of shared-doc sessions report "lost my changes" in support tickets.
## 2. Goals
- Multiple users see each other's edits within 2 seconds
- No lost updates (merge, don't overwrite)
- Works for 50 concurrent editors per doc
## 3. Non-Goals
- Offline editing (not in scope)
- Video/audio (separate initiative)
## 4. Constraints
- Must work with existing auth system
- No new infra for MVP (use existing WebSocket infra)
- Budget: 2 engineers, 6 weeks
## 5. Proposed Solution
[Diagram: Client → WebSocket → Broadcast Service → Clients]
- Use Operational Transform (OT) for conflict resolution
- WebSocket room per document; broadcast deltas
- Server is source of truth; clients apply OT locally
## 6. Alternatives Considered
| Option | Pros | Cons |
|--------|------|------|
| CRDTs | No server | Complex; larger payload |
| Last-write-wins | Simple | Lost updates |
| OT (chosen) | Proven, server control | Server complexity |
## 7. Failure Modes
- WebSocket disconnect → queue deltas; reconnect and sync
- Server crash → clients reconnect; full doc sync on join
## 8. Observability
- Metric: latency p99 for delta delivery
- Alert: >5s disconnect rate
- Dashboard: active editors per doc
## 9. Owner + Approval
- Owner: Collaboration team
- Tech lead approver: [Name]
- PM approver: [Name]
## 10. Rollout
- Feature flag: realtime_collab
- Phase 1: 5% internal, 2 weeks
- Phase 2: 50% beta, 2 weeks
- Phase 3: 100%
## 11. Rollback
- Disable flag → fall back to manual refresh
- No data migration to revert
- Trigger: rollback if disconnect rate > 3x baseline or document corruption is detected
## 12. Open Questions
- OT library choice: ShareDB vs custom?
- How to handle very large docs (100k+ chars)?
Tips on diagrams:
- Use ASCII or Mermaid for version control.
- Show data flow: where does data come from, where does it go?
- Call out failure points (e.g., "if WebSocket drops...").
- For complex flows: sequence diagrams (who does what when); for architecture: component diagrams (boxes and arrows).
Diagram types: Component diagram (services, DBs), sequence diagram (request flow), state diagram (status transitions), data flow (where data lives). Pick the right one for the question.
What makes a design doc get approved:
- Clear problem statement with metrics.
- Explicit tradeoffs; no "we'll figure it out later."
- Explicit owner and approver.
- Rollback plan. No rollback = no approval.
- Bounded scope. "Phase 1" is better than "everything."
Senior rule:
A design doc that doesn't answer "what if it fails?" is incomplete.
ADR (ARCHITECTURE DECISION RECORD)
One page:
-
Context
-
Decision
-
Consequences
-
Status (proposed/accepted/deprecated)
Senior rule:
ADRs prevent teams from re-litigating old decisions endlessly.
Complete ADR example:
# ADR-002: Use PostgreSQL for Event Sourcing
## Status
Accepted (2024-01-15)
## Context
We need to store audit events for compliance. Volume: ~500k events/day.
Options: PostgreSQL append-only table, Kafka, DynamoDB.
## Decision
Use PostgreSQL with an append-only `events` table. Partition by month.
No schema changes to existing rows.
## Consequences
- **Positive:** Single DB; familiar ops; ACID; simple queries.
- **Negative:** Scale limit ~10M events/month per partition; may need migration later.
- **Neutral:** No real-time stream; batch consumers only.
## Alternatives Rejected
- Kafka: Overkill for our volume; new infra.
- DynamoDB: Cost at scale; query patterns less natural.
When to write one:
- Choosing a database, framework, or architecture pattern.
- Deprecating a system or approach.
- Any decision that will be questioned in 6 months.
ADR lifecycle:
- Proposed → draft; discuss
- Accepted → decision made; implement
- Deprecated → superseded by new ADR; link to replacement
Tooling:
- adr-tools: CLI for creating/numbering ADRs.
- Markdown in repo:
docs/adr/folder; link from README. - No fancy tooling needed: Plain Markdown is enough.
INCIDENT POSTMORTEM (TEMPLATE)
-
Summary
-
Impact
-
Timeline
-
Root cause
-
Contributing factors
-
What went well
-
What didn't
-
Action items (owners + due dates)
Avoid blame. Optimize learning.
Detailed timeline example:
2024-02-03 14:32 User reports "payments not loading"
2024-02-03 14:35 On-call paged; checks dashboard
2024-02-03 14:38 Payment API p99 latency spike detected
2024-02-03 14:40 DB connection pool exhausted (alerts)
2024-02-03 14:42 Rollback of deployment from 14:00 (new query)
2024-02-03 14:50 Connection pool recovers; latency normal
2024-02-03 14:55 All-clear; incident resolved
5-whys analysis:
- Why did payments fail? → API timed out.
- Why did API timeout? → DB queries slow.
- Why slow? → New N+1 query in deployment.
- Why did it ship? → N+1 not caught in code review.
- Why not caught? → No query performance test in CI.
Blameless culture guidance:
- Use "we" not "they." "We shipped a slow query" not "Dev X broke it."
- Focus on process: what allowed the bug to reach production?
- Action items should improve systems, not punish people.
Action item tracking:
- Owner + due date for each item.
- Link to postmortem in ticket.
- Track in incident follow-up; close when done.
Postmortem distribution: Share with eng org; optionally with customers (sanitized). Store in incident database or wiki. Reference in future incidents: "Similar to INC-2024-02."
Senior rule:
The goal of a postmortem is to make the next incident less likely, not to assign blame.
LAUNCH CHECKLIST (SAFE SHIP)
-
feature flag ready
-
dashboards and alerts exist
-
rollback plan rehearsed
-
load test / capacity check
-
security review complete
-
on-call aware
-
release owner named
-
go/no-go decision maker named
-
rollback triggers written in advance
Pre-launch phase (1–2 weeks before):
- Feature flag created; off by default.
- Dashboards: latency, error rate, business metrics.
- Alerts: threshold-based; escalation path defined.
- Load test: run at 2x expected traffic; fix bottlenecks.
- Security review: OWASP checklist; no critical vulns.
- Rollback plan: documented; team knows the steps.
- Runbook: what to do if X goes wrong.
- Release owner: one person accountable for the launch.
- Go/no-go decider: explicit name, not "the room decides."
- Rollback trigger: written threshold, e.g. "rollback if 5xx > 2% for 10 minutes."
Day-of phase:
- On-call aware: who's watching; how to contact.
- Release owner starts the rollout and calls pauses.
- Staged rollout: 1% → 5% → 25% → 50% → 100%.
- Monitor each step; pause if error rate spikes.
- Comms: status page, stakeholders.
Post-launch phase:
- 24–48h: watch for delayed issues.
- 1 week: remove old code path if stable.
- Retro: what went well; what to improve.
Checklist items in detail:
- Feature flag: Must be kill-switch; no code path that bypasses it.
- Dashboards: Include baseline (pre-launch) for comparison.
- Rollback rehearsed: Team has done it in staging; knows the steps.
- Rollback trigger: Predefined threshold; do not invent it during the incident.
- Load test: At least 2x expected peak; identify bottlenecks.
- Security review: No critical/high vulns; authZ verified.
- On-call: Pager rotation updated; runbook exists.
- Release owner: One accountable person; no ambiguity at launch time.
- Go/no-go: One decision maker with clear pass/fail criteria.
Launch decision format:
## Launch Decision
- **Release owner**: [Name]
- **Go/No-Go decider**: [Name]
- **Primary success metric**: [Metric + target]
- **Rollback trigger**: [Threshold]
- **Escalation path**: [Who joins if the rollout stalls]
Senior rule:
Never launch without a rollback plan. If you can't roll back, you're not ready.
HOW TO RUN A DESIGN REVIEW
-
send doc 24h early
-
start with constraints + invariants
-
explicitly list tradeoffs
-
end with decision + follow-ups
Facilitation tips:
- Time-box: 30–45 min. Respect attendees' time.
- Assign a moderator: keep discussion on track.
- Capture decisions in the doc or a follow-up comment.
Common anti-patterns in reviews:
- Scope creep: "While we're at it, let's also..." → Park for later.
- Solution-first: Jumping to implementation before problem is clear → Redirect to problem.
- Bike-shedding: Arguing over naming for 20 min → Call it; move on.
How to handle disagreements:
- Acknowledge: "Good point. Let's capture that as a risk."
- Defer: "Out of scope for this doc; separate RFC?"
- Decide: "We're going with X. We can revisit in 90 days."
- Escalate: If no consensus, involve tech lead or product.
Pre-meeting prep: Author: ensure doc is complete; list open questions. Attendees: read doc before meeting; come with questions. Moderator: prepare agenda; allocate time per section.
Post-meeting: Send summary: decision, action items, owners. Update doc with any changes. Create follow-up tickets.
Senior rule:
A design review without a clear decision at the end is a wasted meeting.
HOW TO WRITE AN RFC THAT GETS ADOPTED
-
anchor to a real pain
-
keep scope tight
-
include rollout path
-
make adoption incremental
RFC structure:
- Title: Clear, descriptive.
- Summary: One paragraph. What and why.
- Motivation: What pain does this solve? Metrics if possible.
- Proposal: Detailed design. Diagrams.
- Alternatives: What else did you consider?
- Rollout: How to adopt incrementally?
- Open questions: What's unresolved?
- Owner + approval: Who is accountable? Who signs off?
RFC lifecycle:
- Draft → Author is iterating; feedback welcome.
- Review → Ready for review; comment period (e.g., 1 week).
- Accepted → Approved; ready to implement.
- Implemented → Done; link to PRs.
Tips for getting buy-in:
- Lead with pain: "We lose 2 hours/week debugging X."
- Show, don't tell: Prototype or spike before full RFC.
- Incremental adoption: "Phase 1: one team. Phase 2: expand."
- Address objections: Preempt "what about Y?" in the doc.
Senior rule:
An RFC that's too big will stall. Ship a small RFC first; extend later.
RUNBOOKS & OPERATIONAL DOCS
Runbook structure:
- Title: What incident/operation this covers.
- When to use: Trigger (alert, page, manual).
- Prerequisites: Access, permissions, tools.
- Steps: Numbered; copy-paste if possible.
- Decision trees: If X, do Y. If Z, do W.
- Escalation: When to call someone else.
Decision tree example:
Alert: Payment API latency > 500ms
├─ Check DB connections
│ ├─ Pool exhausted? → Scale up or kill long-running queries
│ └─ Pool OK? → Check downstream services
│
├─ Check error rate
│ ├─ 5xx spike? → Check recent deploy; consider rollback
│ └─ 5xx OK? → Check external dependencies (Stripe, etc.)
│
└─ Still unclear? → Escalate to on-call lead
Alert-to-action mapping:
| Alert | Action | Runbook |
|---|---|---|
| High error rate | Check deploy; rollback if recent | runbook-rollback.md |
| DB latency | Check slow queries; kill if needed | runbook-db.md |
| Disk full | Clean logs; expand volume | runbook-disk.md |
Runbook maintenance: Runbooks rot. Review quarterly. After each incident: "Did the runbook help? What's missing?" Add "last verified" date. Test runbooks in staging.
Runbook format: Use numbered steps. Include exact commands. Add "Expected output" so on-call knows if step succeeded. Add "If this fails" for common failure modes.
Example runbook step:
Step 3: Check DB connection count
$ psql -c "SELECT count(*) FROM pg_stat_activity"
Expected: < 100. If > 150, pool may be exhausted.
If this fails: Ensure psql is installed; check DB credentials in env.
Runbook templates: Create templates for common patterns (rollback, scale up, restart service). Customize per service. Reduces time to write new runbooks.
Senior rule:
Every alert should have a runbook. If you don't know what to do when it fires, the alert is noise.
DOCUMENTATION AS CODE
Docs-as-code practices:
- Store docs in the repo (Markdown, not Confluence).
- Version with code; PR review for doc changes.
- Build from source (e.g., Docusaurus, MkDocs).
Keeping docs current:
- Link docs to code: "See
api/orders/for implementation." - Add "last updated" or auto-generate from git.
- Deprecate old docs: "Superseded by X" with link.
Review processes:
- Doc changes in same PR as code changes when possible.
- Dedicated doc review for large docs.
- Lint: broken links, spelling, structure.
Tooling: Markdown lint (markdownlint), link checker (lychee, linkchecker), spell check (cspell). Run in CI. Docusaurus, MkDocs, VitePress for static site generation from Markdown.
Ownership: Assign doc owners for critical docs. Review during onboarding: "Read these first." Deprecation notices: when removing a feature, update docs and add "Deprecated" banner.
Senior rule:
Docs that live outside the repo rot. Docs that live in the repo get updated when the code changes.
EXERCISES
-
Write a design doc for your next feature using the template.
-
Create an ADR for a recent decision your team made.
-
Draft a postmortem for a past incident (even if informal).
-
Build a launch checklist for your next release.
-
Write a runbook for one of your alerts.
APPENDIX — ARTIFACT QUICK REFERENCE
| Artifact | When | Key sections |
|---|---|---|
| Design doc | New feature, architecture change | Problem, goals/non-goals, constraints, solution, tradeoffs, failure modes, observability, owner/approval, rollout, rollback, open questions |
| ADR | Significant decision | Context, decision, consequences |
| Postmortem | After incident | Timeline, root cause, action items |
| Launch checklist | Before ship | Flag, dashboards, release owner, go/no-go, rollback trigger, on-call |
| Runbook | Per alert | Steps, decision tree, escalation |
| RFC | Cross-team change | Motivation, proposal, rollout, owner/approval, open questions |