Part VIII (d) — Execution Artifacts (Design Docs, ADRs, Postmortems, Launch)

OUTPUT AT SENIOR LEVEL IS DOCUMENTED DECISION-MAKING

Seniors don't just ship code.

They ship:

clarity
alignment
reversible decisions
safe rollouts

Artifacts are how you create leverage.

When To Reach For Which Artifact

Use the smallest artifact that still creates sufficient clarity:

Situation	Best Artifact
small local change	ticket or PR description
multi-service or risky change	design doc
durable architecture choice	ADR
incident learning	postmortem
risky release	launch checklist
repeated operational task	runbook

Teams become document-heavy when they use the same artifact for every situation. They become chaotic when they use none.

GOLD-STANDARD DESIGN DOC (TEMPLATE)

Include:

Problem + user impact
Goals / non-goals
Constraints
Proposed solution (with diagrams)
Alternatives + tradeoffs
Failure modes
Observability plan
Rollout + rollback
Open questions
Owner + approver

When to write a design doc: New feature spanning multiple services. Architecture change. Migration. Anything that affects more than one team or has significant risk. For small, localized changes, a ticket description may suffice.

Full template with example filled-in sections:

# Design Doc: Real-time Collaboration for Document Editor

## 1. Problem & User Impact
Users editing the same document see stale content. Edits overwrite each other.
Impact: 40% of shared-doc sessions report "lost my changes" in support tickets.

## 2. Goals
- Multiple users see each other's edits within 2 seconds
- No lost updates (merge, don't overwrite)
- Works for 50 concurrent editors per doc

## 3. Non-Goals
- Offline editing (not in scope)
- Video/audio (separate initiative)

## 4. Constraints
- Must work with existing auth system
- No new infra for MVP (use existing WebSocket infra)
- Budget: 2 engineers, 6 weeks

## 5. Proposed Solution
[Diagram: Client → WebSocket → Broadcast Service → Clients]
- Use Operational Transform (OT) for conflict resolution
- WebSocket room per document; broadcast deltas
- Server is source of truth; clients apply OT locally

## 6. Alternatives Considered
| Option | Pros | Cons |
|--------|------|------|
| CRDTs | No server | Complex; larger payload |
| Last-write-wins | Simple | Lost updates |
| OT (chosen) | Proven, server control | Server complexity |

## 7. Failure Modes
- WebSocket disconnect → queue deltas; reconnect and sync
- Server crash → clients reconnect; full doc sync on join

## 8. Observability
- Metric: latency p99 for delta delivery
- Alert: >5s disconnect rate
- Dashboard: active editors per doc

## 9. Owner + Approval
- Owner: Collaboration team
- Tech lead approver: [Name]
- PM approver: [Name]

## 10. Rollout
- Feature flag: realtime_collab
- Phase 1: 5% internal, 2 weeks
- Phase 2: 50% beta, 2 weeks
- Phase 3: 100%

## 11. Rollback
- Disable flag → fall back to manual refresh
- No data migration to revert
- Trigger: rollback if disconnect rate > 3x baseline or document corruption is detected

## 12. Open Questions
- OT library choice: ShareDB vs custom?
- How to handle very large docs (100k+ chars)?

Tips on diagrams:

Use ASCII or Mermaid for version control.
Show data flow: where does data come from, where does it go?
Call out failure points (e.g., "if WebSocket drops...").
For complex flows: sequence diagrams (who does what when); for architecture: component diagrams (boxes and arrows).

Diagram types: Component diagram (services, DBs), sequence diagram (request flow), state diagram (status transitions), data flow (where data lives). Pick the right one for the question.

What makes a design doc get approved:

Clear problem statement with metrics.
Explicit tradeoffs; no "we'll figure it out later."
Explicit owner and approver.
Rollback plan. No rollback = no approval.
Bounded scope. "Phase 1" is better than "everything."

Senior rule:

A design doc that doesn't answer "what if it fails?" is incomplete.

ADR (ARCHITECTURE DECISION RECORD)

One page:

Context
Decision
Consequences
Status (proposed/accepted/deprecated)

Senior rule:

ADRs prevent teams from re-litigating old decisions endlessly.

Complete ADR example:

# ADR-002: Use PostgreSQL for Event Sourcing

## Status
Accepted (2024-01-15)

## Context
We need to store audit events for compliance. Volume: ~500k events/day.
Options: PostgreSQL append-only table, Kafka, DynamoDB.

## Decision
Use PostgreSQL with an append-only `events` table. Partition by month.
No schema changes to existing rows.

## Consequences
- **Positive:** Single DB; familiar ops; ACID; simple queries.
- **Negative:** Scale limit ~10M events/month per partition; may need migration later.
- **Neutral:** No real-time stream; batch consumers only.

## Alternatives Rejected
- Kafka: Overkill for our volume; new infra.
- DynamoDB: Cost at scale; query patterns less natural.

When to write one:

Choosing a database, framework, or architecture pattern.
Deprecating a system or approach.
Any decision that will be questioned in 6 months.

ADR lifecycle:

Proposed → draft; discuss
Accepted → decision made; implement
Deprecated → superseded by new ADR; link to replacement

Tooling:

adr-tools: CLI for creating/numbering ADRs.
Markdown in repo: docs/adr/ folder; link from README.
No fancy tooling needed: Plain Markdown is enough.

INCIDENT POSTMORTEM (TEMPLATE)

Summary
Impact
Timeline
Root cause
Contributing factors
What went well
What didn't
Action items (owners + due dates)

Avoid blame. Optimize learning.

Detailed timeline example:

2024-02-03 14:32  User reports "payments not loading"
2024-02-03 14:35  On-call paged; checks dashboard
2024-02-03 14:38  Payment API p99 latency spike detected
2024-02-03 14:40  DB connection pool exhausted (alerts)
2024-02-03 14:42  Rollback of deployment from 14:00 (new query)
2024-02-03 14:50  Connection pool recovers; latency normal
2024-02-03 14:55  All-clear; incident resolved

5-whys analysis:

Why did payments fail? → API timed out.
Why did API timeout? → DB queries slow.
Why slow? → New N+1 query in deployment.
Why did it ship? → N+1 not caught in code review.
Why not caught? → No query performance test in CI.

Blameless culture guidance:

Use "we" not "they." "We shipped a slow query" not "Dev X broke it."
Focus on process: what allowed the bug to reach production?
Action items should improve systems, not punish people.

Action item tracking:

Owner + due date for each item.
Link to postmortem in ticket.
Track in incident follow-up; close when done.

Postmortem distribution: Share with eng org; optionally with customers (sanitized). Store in incident database or wiki. Reference in future incidents: "Similar to INC-2024-02."

Senior rule:

The goal of a postmortem is to make the next incident less likely, not to assign blame.

LAUNCH CHECKLIST (SAFE SHIP)

feature flag ready
dashboards and alerts exist
rollback plan rehearsed
load test / capacity check
security review complete
on-call aware
release owner named
go/no-go decision maker named
rollback triggers written in advance

Pre-launch phase (1–2 weeks before):

Feature flag created; off by default.
Dashboards: latency, error rate, business metrics.
Alerts: threshold-based; escalation path defined.
Load test: run at 2x expected traffic; fix bottlenecks.
Security review: OWASP checklist; no critical vulns.
Rollback plan: documented; team knows the steps.
Runbook: what to do if X goes wrong.
Release owner: one person accountable for the launch.
Go/no-go decider: explicit name, not "the room decides."
Rollback trigger: written threshold, e.g. "rollback if 5xx > 2% for 10 minutes."

Day-of phase:

On-call aware: who's watching; how to contact.
Release owner starts the rollout and calls pauses.
Staged rollout: 1% → 5% → 25% → 50% → 100%.
Monitor each step; pause if error rate spikes.
Comms: status page, stakeholders.

Post-launch phase:

24–48h: watch for delayed issues.
1 week: remove old code path if stable.
Retro: what went well; what to improve.

Checklist items in detail:

Feature flag: Must be kill-switch; no code path that bypasses it.
Dashboards: Include baseline (pre-launch) for comparison.
Rollback rehearsed: Team has done it in staging; knows the steps.
Rollback trigger: Predefined threshold; do not invent it during the incident.
Load test: At least 2x expected peak; identify bottlenecks.
Security review: No critical/high vulns; authZ verified.
On-call: Pager rotation updated; runbook exists.
Release owner: One accountable person; no ambiguity at launch time.
Go/no-go: One decision maker with clear pass/fail criteria.

Launch decision format:

## Launch Decision
- **Release owner**: [Name]
- **Go/No-Go decider**: [Name]
- **Primary success metric**: [Metric + target]
- **Rollback trigger**: [Threshold]
- **Escalation path**: [Who joins if the rollout stalls]

Senior rule:

Never launch without a rollback plan. If you can't roll back, you're not ready.

HOW TO RUN A DESIGN REVIEW

send doc 24h early
start with constraints + invariants
explicitly list tradeoffs
end with decision + follow-ups

Facilitation tips:

Time-box: 30–45 min. Respect attendees' time.
Assign a moderator: keep discussion on track.
Capture decisions in the doc or a follow-up comment.

Common anti-patterns in reviews:

Scope creep: "While we're at it, let's also..." → Park for later.
Solution-first: Jumping to implementation before problem is clear → Redirect to problem.
Bike-shedding: Arguing over naming for 20 min → Call it; move on.

How to handle disagreements:

Acknowledge: "Good point. Let's capture that as a risk."
Defer: "Out of scope for this doc; separate RFC?"
Decide: "We're going with X. We can revisit in 90 days."
Escalate: If no consensus, involve tech lead or product.

Pre-meeting prep: Author: ensure doc is complete; list open questions. Attendees: read doc before meeting; come with questions. Moderator: prepare agenda; allocate time per section.

Post-meeting: Send summary: decision, action items, owners. Update doc with any changes. Create follow-up tickets.

Senior rule:

A design review without a clear decision at the end is a wasted meeting.

HOW TO WRITE AN RFC THAT GETS ADOPTED

anchor to a real pain
keep scope tight
include rollout path
make adoption incremental

RFC structure:

Title: Clear, descriptive.
Summary: One paragraph. What and why.
Motivation: What pain does this solve? Metrics if possible.
Proposal: Detailed design. Diagrams.
Alternatives: What else did you consider?
Rollout: How to adopt incrementally?
Open questions: What's unresolved?
Owner + approval: Who is accountable? Who signs off?

RFC lifecycle:

Draft → Author is iterating; feedback welcome.
Review → Ready for review; comment period (e.g., 1 week).
Accepted → Approved; ready to implement.
Implemented → Done; link to PRs.

Tips for getting buy-in:

Lead with pain: "We lose 2 hours/week debugging X."
Show, don't tell: Prototype or spike before full RFC.
Incremental adoption: "Phase 1: one team. Phase 2: expand."
Address objections: Preempt "what about Y?" in the doc.

Senior rule:

An RFC that's too big will stall. Ship a small RFC first; extend later.

RUNBOOKS & OPERATIONAL DOCS

Runbook structure:

Title: What incident/operation this covers.
When to use: Trigger (alert, page, manual).
Prerequisites: Access, permissions, tools.
Steps: Numbered; copy-paste if possible.
Decision trees: If X, do Y. If Z, do W.
Escalation: When to call someone else.

Decision tree example:

Alert: Payment API latency > 500ms

├─ Check DB connections
│  ├─ Pool exhausted? → Scale up or kill long-running queries
│  └─ Pool OK? → Check downstream services
│
├─ Check error rate
│  ├─ 5xx spike? → Check recent deploy; consider rollback
│  └─ 5xx OK? → Check external dependencies (Stripe, etc.)
│
└─ Still unclear? → Escalate to on-call lead

Alert-to-action mapping:

Alert	Action	Runbook
High error rate	Check deploy; rollback if recent	runbook-rollback.md
DB latency	Check slow queries; kill if needed	runbook-db.md
Disk full	Clean logs; expand volume	runbook-disk.md

Runbook maintenance: Runbooks rot. Review quarterly. After each incident: "Did the runbook help? What's missing?" Add "last verified" date. Test runbooks in staging.

Runbook format: Use numbered steps. Include exact commands. Add "Expected output" so on-call knows if step succeeded. Add "If this fails" for common failure modes.

Example runbook step:

Step 3: Check DB connection count
$ psql -c "SELECT count(*) FROM pg_stat_activity"
Expected: < 100. If > 150, pool may be exhausted.
If this fails: Ensure psql is installed; check DB credentials in env.

Runbook templates: Create templates for common patterns (rollback, scale up, restart service). Customize per service. Reduces time to write new runbooks.

Senior rule:

Every alert should have a runbook. If you don't know what to do when it fires, the alert is noise.

DOCUMENTATION AS CODE

Docs-as-code practices:

Store docs in the repo (Markdown, not Confluence).
Version with code; PR review for doc changes.
Build from source (e.g., Docusaurus, MkDocs).

Keeping docs current:

Link docs to code: "See api/orders/ for implementation."
Add "last updated" or auto-generate from git.
Deprecate old docs: "Superseded by X" with link.

Review processes:

Doc changes in same PR as code changes when possible.
Dedicated doc review for large docs.
Lint: broken links, spelling, structure.

Tooling: Markdown lint (markdownlint), link checker (lychee, linkchecker), spell check (cspell). Run in CI. Docusaurus, MkDocs, VitePress for static site generation from Markdown.

Ownership: Assign doc owners for critical docs. Review during onboarding: "Read these first." Deprecation notices: when removing a feature, update docs and add "Deprecated" banner.

Senior rule:

Docs that live outside the repo rot. Docs that live in the repo get updated when the code changes.

EXERCISES

Write a design doc for your next feature using the template.
Create an ADR for a recent decision your team made.
Draft a postmortem for a past incident (even if informal).
Build a launch checklist for your next release.
Write a runbook for one of your alerts.

APPENDIX — ARTIFACT QUICK REFERENCE

Artifact	When	Key sections
Design doc	New feature, architecture change	Problem, goals/non-goals, constraints, solution, tradeoffs, failure modes, observability, owner/approval, rollout, rollback, open questions
ADR	Significant decision	Context, decision, consequences
Postmortem	After incident	Timeline, root cause, action items
Launch checklist	Before ship	Flag, dashboards, release owner, go/no-go, rollback trigger, on-call
Runbook	Per alert	Steps, decision tree, escalation
RFC	Cross-team change	Motivation, proposal, rollout, owner/approval, open questions

🏁 END — EXECUTION ARTIFACTS

OUTPUT AT SENIOR LEVEL IS DOCUMENTED DECISION-MAKING​

When To Reach For Which Artifact​

GOLD-STANDARD DESIGN DOC (TEMPLATE)​

ADR (ARCHITECTURE DECISION RECORD)​

INCIDENT POSTMORTEM (TEMPLATE)​

LAUNCH CHECKLIST (SAFE SHIP)​

HOW TO RUN A DESIGN REVIEW​

HOW TO WRITE AN RFC THAT GETS ADOPTED​

RUNBOOKS & OPERATIONAL DOCS​

DOCUMENTATION AS CODE​

EXERCISES​

APPENDIX — ARTIFACT QUICK REFERENCE​

🏁 END — EXECUTION ARTIFACTS

OUTPUT AT SENIOR LEVEL IS DOCUMENTED DECISION-MAKING

When To Reach For Which Artifact

GOLD-STANDARD DESIGN DOC (TEMPLATE)

ADR (ARCHITECTURE DECISION RECORD)

INCIDENT POSTMORTEM (TEMPLATE)

LAUNCH CHECKLIST (SAFE SHIP)

HOW TO RUN A DESIGN REVIEW

HOW TO WRITE AN RFC THAT GETS ADOPTED

RUNBOOKS & OPERATIONAL DOCS

DOCUMENTATION AS CODE

EXERCISES

APPENDIX — ARTIFACT QUICK REFERENCE