Skip to main content

Part VIII (d) — Execution Artifacts (Design Docs, ADRs, Postmortems, Launch)

OUTPUT AT SENIOR LEVEL IS DOCUMENTED DECISION-MAKING

Seniors don't just ship code.

They ship:

  • clarity

  • alignment

  • reversible decisions

  • safe rollouts

Artifacts are how you create leverage.


When To Reach For Which Artifact

Use the smallest artifact that still creates sufficient clarity:

SituationBest Artifact
small local changeticket or PR description
multi-service or risky changedesign doc
durable architecture choiceADR
incident learningpostmortem
risky releaselaunch checklist
repeated operational taskrunbook

Teams become document-heavy when they use the same artifact for every situation. They become chaotic when they use none.


GOLD-STANDARD DESIGN DOC (TEMPLATE)

Include:

  • Problem + user impact

  • Goals / non-goals

  • Constraints

  • Proposed solution (with diagrams)

  • Alternatives + tradeoffs

  • Failure modes

  • Observability plan

  • Rollout + rollback

  • Open questions

  • Owner + approver

When to write a design doc: New feature spanning multiple services. Architecture change. Migration. Anything that affects more than one team or has significant risk. For small, localized changes, a ticket description may suffice.

Full template with example filled-in sections:

# Design Doc: Real-time Collaboration for Document Editor

## 1. Problem & User Impact
Users editing the same document see stale content. Edits overwrite each other.
Impact: 40% of shared-doc sessions report "lost my changes" in support tickets.

## 2. Goals
- Multiple users see each other's edits within 2 seconds
- No lost updates (merge, don't overwrite)
- Works for 50 concurrent editors per doc

## 3. Non-Goals
- Offline editing (not in scope)
- Video/audio (separate initiative)

## 4. Constraints
- Must work with existing auth system
- No new infra for MVP (use existing WebSocket infra)
- Budget: 2 engineers, 6 weeks

## 5. Proposed Solution
[Diagram: Client → WebSocket → Broadcast Service → Clients]
- Use Operational Transform (OT) for conflict resolution
- WebSocket room per document; broadcast deltas
- Server is source of truth; clients apply OT locally

## 6. Alternatives Considered
| Option | Pros | Cons |
|--------|------|------|
| CRDTs | No server | Complex; larger payload |
| Last-write-wins | Simple | Lost updates |
| OT (chosen) | Proven, server control | Server complexity |

## 7. Failure Modes
- WebSocket disconnect → queue deltas; reconnect and sync
- Server crash → clients reconnect; full doc sync on join

## 8. Observability
- Metric: latency p99 for delta delivery
- Alert: >5s disconnect rate
- Dashboard: active editors per doc

## 9. Owner + Approval
- Owner: Collaboration team
- Tech lead approver: [Name]
- PM approver: [Name]

## 10. Rollout
- Feature flag: realtime_collab
- Phase 1: 5% internal, 2 weeks
- Phase 2: 50% beta, 2 weeks
- Phase 3: 100%

## 11. Rollback
- Disable flag → fall back to manual refresh
- No data migration to revert
- Trigger: rollback if disconnect rate > 3x baseline or document corruption is detected

## 12. Open Questions
- OT library choice: ShareDB vs custom?
- How to handle very large docs (100k+ chars)?

Tips on diagrams:

  • Use ASCII or Mermaid for version control.
  • Show data flow: where does data come from, where does it go?
  • Call out failure points (e.g., "if WebSocket drops...").
  • For complex flows: sequence diagrams (who does what when); for architecture: component diagrams (boxes and arrows).

Diagram types: Component diagram (services, DBs), sequence diagram (request flow), state diagram (status transitions), data flow (where data lives). Pick the right one for the question.

What makes a design doc get approved:

  • Clear problem statement with metrics.
  • Explicit tradeoffs; no "we'll figure it out later."
  • Explicit owner and approver.
  • Rollback plan. No rollback = no approval.
  • Bounded scope. "Phase 1" is better than "everything."

Senior rule:

A design doc that doesn't answer "what if it fails?" is incomplete.


ADR (ARCHITECTURE DECISION RECORD)

One page:

  • Context

  • Decision

  • Consequences

  • Status (proposed/accepted/deprecated)

Senior rule:

ADRs prevent teams from re-litigating old decisions endlessly.

Complete ADR example:

# ADR-002: Use PostgreSQL for Event Sourcing

## Status
Accepted (2024-01-15)

## Context
We need to store audit events for compliance. Volume: ~500k events/day.
Options: PostgreSQL append-only table, Kafka, DynamoDB.

## Decision
Use PostgreSQL with an append-only `events` table. Partition by month.
No schema changes to existing rows.

## Consequences
- **Positive:** Single DB; familiar ops; ACID; simple queries.
- **Negative:** Scale limit ~10M events/month per partition; may need migration later.
- **Neutral:** No real-time stream; batch consumers only.

## Alternatives Rejected
- Kafka: Overkill for our volume; new infra.
- DynamoDB: Cost at scale; query patterns less natural.

When to write one:

  • Choosing a database, framework, or architecture pattern.
  • Deprecating a system or approach.
  • Any decision that will be questioned in 6 months.

ADR lifecycle:

  • Proposed → draft; discuss
  • Accepted → decision made; implement
  • Deprecated → superseded by new ADR; link to replacement

Tooling:

  • adr-tools: CLI for creating/numbering ADRs.
  • Markdown in repo: docs/adr/ folder; link from README.
  • No fancy tooling needed: Plain Markdown is enough.

INCIDENT POSTMORTEM (TEMPLATE)

  • Summary

  • Impact

  • Timeline

  • Root cause

  • Contributing factors

  • What went well

  • What didn't

  • Action items (owners + due dates)

Avoid blame. Optimize learning.

Detailed timeline example:

2024-02-03 14:32  User reports "payments not loading"
2024-02-03 14:35 On-call paged; checks dashboard
2024-02-03 14:38 Payment API p99 latency spike detected
2024-02-03 14:40 DB connection pool exhausted (alerts)
2024-02-03 14:42 Rollback of deployment from 14:00 (new query)
2024-02-03 14:50 Connection pool recovers; latency normal
2024-02-03 14:55 All-clear; incident resolved

5-whys analysis:

  • Why did payments fail? → API timed out.
  • Why did API timeout? → DB queries slow.
  • Why slow? → New N+1 query in deployment.
  • Why did it ship? → N+1 not caught in code review.
  • Why not caught? → No query performance test in CI.

Blameless culture guidance:

  • Use "we" not "they." "We shipped a slow query" not "Dev X broke it."
  • Focus on process: what allowed the bug to reach production?
  • Action items should improve systems, not punish people.

Action item tracking:

  • Owner + due date for each item.
  • Link to postmortem in ticket.
  • Track in incident follow-up; close when done.

Postmortem distribution: Share with eng org; optionally with customers (sanitized). Store in incident database or wiki. Reference in future incidents: "Similar to INC-2024-02."

Senior rule:

The goal of a postmortem is to make the next incident less likely, not to assign blame.


LAUNCH CHECKLIST (SAFE SHIP)

  • feature flag ready

  • dashboards and alerts exist

  • rollback plan rehearsed

  • load test / capacity check

  • security review complete

  • on-call aware

  • release owner named

  • go/no-go decision maker named

  • rollback triggers written in advance

Pre-launch phase (1–2 weeks before):

  • Feature flag created; off by default.
  • Dashboards: latency, error rate, business metrics.
  • Alerts: threshold-based; escalation path defined.
  • Load test: run at 2x expected traffic; fix bottlenecks.
  • Security review: OWASP checklist; no critical vulns.
  • Rollback plan: documented; team knows the steps.
  • Runbook: what to do if X goes wrong.
  • Release owner: one person accountable for the launch.
  • Go/no-go decider: explicit name, not "the room decides."
  • Rollback trigger: written threshold, e.g. "rollback if 5xx > 2% for 10 minutes."

Day-of phase:

  • On-call aware: who's watching; how to contact.
  • Release owner starts the rollout and calls pauses.
  • Staged rollout: 1% → 5% → 25% → 50% → 100%.
  • Monitor each step; pause if error rate spikes.
  • Comms: status page, stakeholders.

Post-launch phase:

  • 24–48h: watch for delayed issues.
  • 1 week: remove old code path if stable.
  • Retro: what went well; what to improve.

Checklist items in detail:

  • Feature flag: Must be kill-switch; no code path that bypasses it.
  • Dashboards: Include baseline (pre-launch) for comparison.
  • Rollback rehearsed: Team has done it in staging; knows the steps.
  • Rollback trigger: Predefined threshold; do not invent it during the incident.
  • Load test: At least 2x expected peak; identify bottlenecks.
  • Security review: No critical/high vulns; authZ verified.
  • On-call: Pager rotation updated; runbook exists.
  • Release owner: One accountable person; no ambiguity at launch time.
  • Go/no-go: One decision maker with clear pass/fail criteria.

Launch decision format:

## Launch Decision
- **Release owner**: [Name]
- **Go/No-Go decider**: [Name]
- **Primary success metric**: [Metric + target]
- **Rollback trigger**: [Threshold]
- **Escalation path**: [Who joins if the rollout stalls]

Senior rule:

Never launch without a rollback plan. If you can't roll back, you're not ready.


HOW TO RUN A DESIGN REVIEW

  • send doc 24h early

  • start with constraints + invariants

  • explicitly list tradeoffs

  • end with decision + follow-ups

Facilitation tips:

  • Time-box: 30–45 min. Respect attendees' time.
  • Assign a moderator: keep discussion on track.
  • Capture decisions in the doc or a follow-up comment.

Common anti-patterns in reviews:

  • Scope creep: "While we're at it, let's also..." → Park for later.
  • Solution-first: Jumping to implementation before problem is clear → Redirect to problem.
  • Bike-shedding: Arguing over naming for 20 min → Call it; move on.

How to handle disagreements:

  • Acknowledge: "Good point. Let's capture that as a risk."
  • Defer: "Out of scope for this doc; separate RFC?"
  • Decide: "We're going with X. We can revisit in 90 days."
  • Escalate: If no consensus, involve tech lead or product.

Pre-meeting prep: Author: ensure doc is complete; list open questions. Attendees: read doc before meeting; come with questions. Moderator: prepare agenda; allocate time per section.

Post-meeting: Send summary: decision, action items, owners. Update doc with any changes. Create follow-up tickets.

Senior rule:

A design review without a clear decision at the end is a wasted meeting.


HOW TO WRITE AN RFC THAT GETS ADOPTED

  • anchor to a real pain

  • keep scope tight

  • include rollout path

  • make adoption incremental

RFC structure:

  1. Title: Clear, descriptive.
  2. Summary: One paragraph. What and why.
  3. Motivation: What pain does this solve? Metrics if possible.
  4. Proposal: Detailed design. Diagrams.
  5. Alternatives: What else did you consider?
  6. Rollout: How to adopt incrementally?
  7. Open questions: What's unresolved?
  8. Owner + approval: Who is accountable? Who signs off?

RFC lifecycle:

  • Draft → Author is iterating; feedback welcome.
  • Review → Ready for review; comment period (e.g., 1 week).
  • Accepted → Approved; ready to implement.
  • Implemented → Done; link to PRs.

Tips for getting buy-in:

  • Lead with pain: "We lose 2 hours/week debugging X."
  • Show, don't tell: Prototype or spike before full RFC.
  • Incremental adoption: "Phase 1: one team. Phase 2: expand."
  • Address objections: Preempt "what about Y?" in the doc.

Senior rule:

An RFC that's too big will stall. Ship a small RFC first; extend later.


RUNBOOKS & OPERATIONAL DOCS

Runbook structure:

  • Title: What incident/operation this covers.
  • When to use: Trigger (alert, page, manual).
  • Prerequisites: Access, permissions, tools.
  • Steps: Numbered; copy-paste if possible.
  • Decision trees: If X, do Y. If Z, do W.
  • Escalation: When to call someone else.

Decision tree example:

Alert: Payment API latency > 500ms

├─ Check DB connections
│ ├─ Pool exhausted? → Scale up or kill long-running queries
│ └─ Pool OK? → Check downstream services

├─ Check error rate
│ ├─ 5xx spike? → Check recent deploy; consider rollback
│ └─ 5xx OK? → Check external dependencies (Stripe, etc.)

└─ Still unclear? → Escalate to on-call lead

Alert-to-action mapping:

AlertActionRunbook
High error rateCheck deploy; rollback if recentrunbook-rollback.md
DB latencyCheck slow queries; kill if neededrunbook-db.md
Disk fullClean logs; expand volumerunbook-disk.md

Runbook maintenance: Runbooks rot. Review quarterly. After each incident: "Did the runbook help? What's missing?" Add "last verified" date. Test runbooks in staging.

Runbook format: Use numbered steps. Include exact commands. Add "Expected output" so on-call knows if step succeeded. Add "If this fails" for common failure modes.

Example runbook step:

Step 3: Check DB connection count
$ psql -c "SELECT count(*) FROM pg_stat_activity"
Expected: < 100. If > 150, pool may be exhausted.
If this fails: Ensure psql is installed; check DB credentials in env.

Runbook templates: Create templates for common patterns (rollback, scale up, restart service). Customize per service. Reduces time to write new runbooks.

Senior rule:

Every alert should have a runbook. If you don't know what to do when it fires, the alert is noise.


DOCUMENTATION AS CODE

Docs-as-code practices:

  • Store docs in the repo (Markdown, not Confluence).
  • Version with code; PR review for doc changes.
  • Build from source (e.g., Docusaurus, MkDocs).

Keeping docs current:

  • Link docs to code: "See api/orders/ for implementation."
  • Add "last updated" or auto-generate from git.
  • Deprecate old docs: "Superseded by X" with link.

Review processes:

  • Doc changes in same PR as code changes when possible.
  • Dedicated doc review for large docs.
  • Lint: broken links, spelling, structure.

Tooling: Markdown lint (markdownlint), link checker (lychee, linkchecker), spell check (cspell). Run in CI. Docusaurus, MkDocs, VitePress for static site generation from Markdown.

Ownership: Assign doc owners for critical docs. Review during onboarding: "Read these first." Deprecation notices: when removing a feature, update docs and add "Deprecated" banner.

Senior rule:

Docs that live outside the repo rot. Docs that live in the repo get updated when the code changes.


EXERCISES

  1. Write a design doc for your next feature using the template.

  2. Create an ADR for a recent decision your team made.

  3. Draft a postmortem for a past incident (even if informal).

  4. Build a launch checklist for your next release.

  5. Write a runbook for one of your alerts.


APPENDIX — ARTIFACT QUICK REFERENCE

ArtifactWhenKey sections
Design docNew feature, architecture changeProblem, goals/non-goals, constraints, solution, tradeoffs, failure modes, observability, owner/approval, rollout, rollback, open questions
ADRSignificant decisionContext, decision, consequences
PostmortemAfter incidentTimeline, root cause, action items
Launch checklistBefore shipFlag, dashboards, release owner, go/no-go, rollback trigger, on-call
RunbookPer alertSteps, decision tree, escalation
RFCCross-team changeMotivation, proposal, rollout, owner/approval, open questions

🏁 END — EXECUTION ARTIFACTS