Chapter 11: Non-Functional Specifications
Learning Objectives
By the end of this chapter, you will be able to:
- Define measurable non-functional requirements for the five core quality attributes
- Write performance specifications with latency budgets, throughput targets, and bundle size limits
- Specify security requirements for authentication, encryption, and audit logging
- Define scalability requirements for concurrent users, data volume, and geographic distribution
- Create observability specifications: logs, metrics, and alert thresholds
- Write NFRs for the collaboration platform running project
- Distinguish NFRs from constraints and understand their relationship
- Explain how AI agents use NFRs to make architectural decisions
What Are Non-Functional Requirements?
Functional requirements answer: What does the system do? Non-functional requirements answer: How well does it do it?
A login feature might correctly authenticate users (functional) but take 5 seconds to respond (non-functional failure). A payment system might process transactions correctly (functional) but leak card numbers in logs (non-functional failure). A collaboration platform might allow real-time editing (functional) but support only 10 concurrent users (non-functional failure).
Non-functional requirements define the quality attributes that determine whether a system succeeds in production: performance, security, scalability, reliability, and observability. Without them, implementations can be correct and useless — or correct and dangerous.
The Measurability Principle
The cardinal rule of NFRs: every NFR must be measurable. Vague requirements like "the system should be fast" or "the system must be secure" are untestable. AI cannot implement "fast." It can implement "p95 latency < 200ms."
Bad vs. Good
| Bad (Vague) | Good (Measurable) |
|---|---|
| The system should be fast | p95 API latency < 200ms; p99 < 500ms |
| The system must be secure | All secrets in vault; TLS 1.3; no PII in logs |
| The system should scale | Support 10,000 concurrent users; horizontal scaling |
| The system should be reliable | 99.9% uptime; RPO < 1 hour; RTO < 4 hours |
| We need good observability | Structured logs; 5 core metrics; 3 alert rules |
The SMART Test
Every NFR should be:
- Specific — Clear what is being measured
- Measurable — Quantifiable or verifiable
- Achievable — Realistic for the context
- Relevant — Tied to business or user impact
- Testable — Can be validated in CI or production
Quality Attribute 1: Performance
Performance NFRs define how quickly the system responds and how much load it can handle.
Latency Budgets
Latency is the time from request to response. Specify percentiles, not averages — p50 (median) can look good while p99 is terrible.
## Performance NFRs
### P-1: API Latency
- **p50**: < 100ms for read operations
- **p95**: < 200ms for read operations
- **p99**: < 500ms for read operations
- **p95**: < 500ms for write operations (create, update, delete)
- **Excluded**: Bulk export, report generation, search (separate SLAs)
### P-2: Page Load
- **LCP** (Largest Contentful Paint): < 2.5s on 4G
- **FID** (First Input Delay): < 100ms
- **CLS** (Cumulative Layout Shift): < 0.1
### P-3: Real-Time Updates
- **WebSocket message latency**: < 100ms from server event to client delivery
- **Presence heartbeat**: 30-second interval; 90-second timeout for "away"
Throughput Targets
Throughput is requests per second (or operations per second) the system must sustain.
### P-4: Throughput
- **Sustained**: 1,000 requests/second per service instance
- **Peak**: 3,000 requests/second (3x normal) for 5 minutes
- **Burst**: 5,000 requests/second for 30 seconds (e.g., flash sale)
Bundle Size Limits
For frontend applications, bundle size affects load time and user experience.
### P-5: Frontend Bundle
- **Initial JS bundle**: < 200 KB gzipped
- **Total initial load**: < 500 KB gzipped (JS + CSS + critical assets)
- **Lazy-loaded route**: < 50 KB gzipped per route
- **No single chunk**: > 250 KB (split large dependencies)
Database and Query Performance
### P-6: Database
- **Query latency**: p95 < 50ms for indexed queries
- **Connection pool**: Max 100 connections per service; queue excess
- **Slow query threshold**: Log and alert on queries > 1 second
Quality Attribute 2: Security
Security NFRs define authentication, authorization, encryption, and audit requirements.
Authentication
## Security NFRs
### SEC-1: Authentication
- **Mechanism**: JWT or session-based; OAuth2 for third-party
- **Token expiry**: Access token 15 minutes; refresh token 7 days
- **Password policy**: Min 12 chars; complexity (upper, lower, digit, special);
bcrypt cost factor 12; no password in logs or errors
- **MFA**: Optional for users; required for admin roles
- **Session invalidation**: On password change, logout, or 30 days idle
Authorization
### SEC-2: Authorization
- **Model**: Role-based (RBAC); roles: Viewer, Member, Admin, Owner
- **Default**: Deny; explicit grant required
- **Resource-level**: Check permission on every API call; no client-side-only checks
- **Admin actions**: Require re-authentication for sensitive operations
Encryption
### SEC-3: Encryption
- **In transit**: TLS 1.3; HSTS enabled; no mixed content
- **At rest**: AES-256 for sensitive data (passwords, tokens, PII)
- **Secrets**: Stored in vault (e.g., HashiCorp Vault, AWS Secrets Manager);
never in code or config files
- **API keys**: Hashed before storage; only hash compared on validation
Audit Logging
### SEC-4: Audit
- **Events**: Login, logout, password change, role change, data export, admin actions
- **Fields**: userId, action, resource, timestamp, IP, userAgent, result (success/failure)
- **Retention**: 90 days hot; 1 year cold storage
- **Access**: Restricted to security team; immutable (append-only)
Data Protection
### SEC-5: Data Protection
- **PII**: No PII in logs, errors, or analytics without hashing/masking
- **GDPR**: Support data export and deletion within 30 days
- **Secrets**: Rotate API keys every 90 days; support emergency rotation
Quality Attribute 3: Scalability
Scalability NFRs define how the system grows with load: more users, more data, more geography.
Concurrent Users
## Scalability NFRs
### SCL-1: Concurrent Users
- **Target**: 10,000 concurrent authenticated users
- **Per-tenant**: 500 concurrent users per workspace (enterprise tier)
- **Real-time**: 1,000 concurrent WebSocket connections per server instance
- **Growth**: Design for 3x growth in 12 months without architecture change
Data Volume
### SCL-2: Data Volume
- **Workspaces**: 100,000 workspaces; 1,000 projects per workspace
- **Documents**: 10 million documents; 1 GB max per document
- **Messages**: 100 million messages; 7-year retention
- **Storage**: Support object storage (S3-compatible) for files; CDN for static assets
Geographic Distribution
### SCL-3: Geographic Distribution
- **Primary region**: Single region for MVP; multi-region for enterprise
- **Latency**: p95 < 300ms for users in primary region
- **Data residency**: Support EU-only deployment for GDPR
- **CDN**: Static assets served from edge; < 100ms TTFB from edge
Horizontal Scaling
### SCL-4: Horizontal Scaling
- **Stateless services**: All API and worker services stateless; scale by adding instances
- **Database**: Read replicas for read-heavy workloads; connection pooling
- **Queue**: Message queue for async jobs; at-least-once delivery
- **Caching**: Distributed cache (Redis) for sessions and hot data
Quality Attribute 4: Reliability
Reliability NFRs define uptime, fault tolerance, and recovery.
Availability
## Reliability NFRs
### REL-1: Availability
- **Target**: 99.9% uptime (8.76 hours downtime/year)
- **Measurement**: Per service; exclude planned maintenance (with notice)
- **Degradation**: Graceful degradation (read-only mode) preferred over full outage
Recovery
### REL-2: Recovery
- **RPO** (Recovery Point Objective): < 1 hour — max data loss in disaster
- **RTO** (Recovery Time Objective): < 4 hours — max time to restore service
- **Backup**: Daily full backup; continuous WAL/replication for database
- **Restore**: Tested quarterly; documented runbook
Fault Tolerance
### REL-3: Fault Tolerance
- **Dependency failure**: Circuit breaker for external services; timeout 30s
- **Retry**: Exponential backoff for transient failures; max 3 retries
- **Degradation**: If email service down, queue emails; if search down, show "search unavailable"
- **Health checks**: Liveness and readiness probes; fail fast on critical failure
Idempotency
### REL-4: Idempotency
- **Mutations**: Support idempotency key for create/update operations
- **Duplicate requests**: Same key within 24 hours returns original result
- **Critical for**: Payments, invitations, webhooks
Quality Attribute 5: Observability
Observability NFRs define what to log, what to measure, and when to alert.
Logging
## Observability NFRs
### OBS-1: Logging
- **Format**: Structured JSON; fields: timestamp, level, service, message, trace_id,
user_id (hashed if PII), request_id, duration_ms, error (if applicable)
- **Levels**: ERROR (always), WARN (always), INFO (request lifecycle, key events),
DEBUG (disabled in prod unless feature-flagged)
- **Exclusions**: No passwords, tokens, or PII in logs
- **Retention**: 7 days hot; 30 days warm; 1 year cold for audit logs
Metrics
### OBS-2: Metrics
- **Request metrics**: Count, latency (p50, p95, p99), error rate per endpoint
- **Business metrics**: Active users, actions per feature, conversion funnel
- **Infrastructure**: CPU, memory, disk, network per service
- **Custom**: Feature-specific (e.g., cart_operations_total, invitations_sent_total)
- **Cardinality**: Avoid high-cardinality labels (e.g., user_id) on high-volume metrics
Tracing
### OBS-3: Tracing
- **Distributed tracing**: Trace ID propagated across services
- **Spans**: Key operations (HTTP, DB, cache, external API)
- **Sampling**: 1% in prod; 100% in staging; configurable per service
- **Retention**: 7 days
Alerts
### OBS-4: Alerts
- **Error rate**: Alert if error rate > 5% for 5 minutes
- **Latency**: Alert if p95 > 2x baseline for 5 minutes
- **Availability**: Alert if health check fails 3 times in 5 minutes
- **Business**: Alert if payment failure rate > 1% for 10 minutes
- **On-call**: Every alert has runbook and owner
Tutorial: NFRs for the Collaboration Platform
The running project is a real-time collaboration platform — a multi-user project management tool with chat, task boards, and notifications. Let's write comprehensive NFRs for it.
Context
- Users: Teams of 2–50; some enterprises up to 500
- Features: Projects, tasks, chat, real-time presence, notifications
- Deployment: Cloud (AWS/GCP); single region for MVP
Performance NFRs
# Collaboration Platform — Non-Functional Requirements
## Performance
### P-1: API Latency
- **Read operations** (GET): p50 < 100ms, p95 < 200ms, p99 < 500ms
- **Write operations** (POST, PATCH, DELETE): p95 < 300ms, p99 < 1s
- **Real-time** (WebSocket): Message delivery < 100ms p95
- **Excluded**: Bulk export, full-text search (separate SLA: p95 < 5s)
### P-2: Frontend
- **LCP**: < 2.5s on 4G
- **TTI** (Time to Interactive): < 4s
- **Bundle**: Initial < 300 KB gzipped; route chunks < 100 KB
### P-3: Real-Time Collaboration
- **Presence**: Heartbeat every 15s; away after 45s
- **Cursor sync**: < 150ms from move to broadcast
- **Typing indicator**: < 200ms
- **Conflict resolution**: Last-write-wins with vector clocks; merge for text
Security NFRs
## Security
### SEC-1: Authentication
- JWT with 15-min access, 7-day refresh
- Password: min 12 chars, bcrypt cost 12
- OAuth2 (Google, GitHub) for SSO
- Session invalidation on password change and logout
### SEC-2: Authorization
- Roles: Viewer, Member, Admin, Owner per workspace
- Project-level: Inherit workspace role; override for private projects
- Chat: Members can read/write; Viewers read-only
### SEC-3: Encryption
- TLS 1.3 for all traffic
- AES-256 for at-rest: user data, messages, file metadata
- Secrets in AWS Secrets Manager / GCP Secret Manager
### SEC-4: Audit
- Log: Login, logout, invite, remove member, role change, project create/delete
- Retention: 90 days
- No message content in audit logs (privacy)
Scalability NFRs
## Scalability
### SCL-1: Concurrent Users
- **Target**: 5,000 concurrent users (MVP)
- **Per workspace**: 100 concurrent (real-time)
- **WebSocket**: 500 connections per server instance
- **Growth**: 2x in 12 months
### SCL-2: Data Volume
- **Workspaces**: 50,000
- **Projects**: 500,000; 100 per workspace
- **Tasks**: 10 million
- **Messages**: 50 million; 2-year retention
- **Files**: 1 TB total; 100 MB per file
### SCL-3: Horizontal Scaling
- Stateless API and WebSocket servers
- Redis for presence and pub/sub
- PostgreSQL with read replicas for read-heavy queries
- S3/GCS for file storage
Reliability NFRs
## Reliability
### REL-1: Availability
- 99.5% uptime (MVP); 99.9% (enterprise)
- Graceful degradation: If real-time fails, fall back to polling
### REL-2: Recovery
- RPO < 1 hour; RTO < 4 hours
- Daily backups; point-in-time recovery for database
### REL-3: Fault Tolerance
- Circuit breaker for external services (email, storage)
- Retry with backoff for transient failures
- Queue notifications; retry on failure
Observability NFRs
## Observability
### OBS-1: Logs
- Structured JSON; trace_id on all request logs
- Log: Request (method, path, status, duration), errors (stack trace), key events (invite, create project)
- No PII in logs
### OBS-2: Metrics
- `http_requests_total`, `http_request_duration_seconds`
- `websocket_connections_active`, `websocket_messages_total`
- `projects_created_total`, `tasks_created_total`, `messages_sent_total`
- `notification_delivery_total`, `notification_failure_total`
### OBS-3: Alerts
- Error rate > 5% for 5 min
- p95 latency > 500ms for 5 min
- WebSocket connection failures > 10% for 5 min
- Database connection pool exhausted
NFRs vs. Constraints
NFRs and constraints are related but distinct:
| Aspect | NFR | Constraint |
|---|---|---|
| Focus | How well | What not to do |
| Example | "p95 latency < 200ms" | "Must not use synchronous external calls in request path" |
| Purpose | Define quality bar | Prevent bad patterns |
| AI usage | Drives architecture (caching, async) | Prevents specific implementations |
Relationship
Constraints often emerge from NFRs. If the NFR says "p95 < 200ms," a constraint might be "No N+1 queries" or "No synchronous external API calls in hot path." The NFR defines the goal; the constraint enforces a pattern that achieves it.
## NFR
P-1: API latency p95 < 200ms
## Derived Constraint
C-1: No synchronous calls to external services (email, analytics) in request path.
Use async/queue.
When to Use Which
- NFR: When you need to specify a measurable target. "The system must support X."
- Constraint: When you need to prohibit a specific approach. "The system must not do Y."
How AI Uses NFRs
AI agents use NFRs to make architectural and implementation decisions. Understanding this helps you write NFRs that produce the right code.
Performance NFRs → Implementation Choices
| NFR | AI Implementation Choice |
|---|---|
| p95 < 200ms | Add caching, optimize queries, use async |
| Bundle < 200 KB | Code splitting, tree shaking, lazy load |
| WebSocket < 100ms | Minimize serialization; consider binary protocol |
Security NFRs → Implementation Choices
| NFR | AI Implementation Choice |
|---|---|
| JWT 15 min | Set exp claim; short-lived tokens |
| bcrypt cost 12 | Use cost factor 12 in bcrypt hash |
| No PII in logs | Sanitize log payloads; redact fields |
Scalability NFRs → Implementation Choices
| NFR | AI Implementation Choice |
|---|---|
| Stateless services | No in-memory session; use Redis |
| 500 WebSocket/instance | Connection limit; load balancer |
| Horizontal scaling | No local file storage; use object store |
Observability NFRs → Implementation Choices
| NFR | AI Implementation Choice |
|---|---|
| Structured logs | Use structured logger (e.g., Pino, structlog) |
| trace_id | Add middleware to propagate trace ID |
| Metrics per endpoint | Add metrics middleware; instrument handlers |
The NFR Checklist for AI
When you provide NFRs to AI, include:
- Quantitative targets — Numbers, not adjectives
- Scope — What is in/out of scope (e.g., "excluded: bulk export")
- Priority — Which NFRs are must-have vs. nice-to-have
- Trade-offs — If NFRs conflict (e.g., latency vs. consistency), which wins?
Writing Effective NFRs: A Template
Use this template for each quality attribute:
## [Quality Attribute]
### [Attribute]-1: [Name]
- **Target**: [Measurable target]
- **Scope**: [What it applies to]
- **Measurement**: [How to verify]
- **Priority**: Must-have / Nice-to-have
### [Attribute]-2: ...
Example: Filled Template
## Performance
### P-1: API Read Latency
- **Target**: p95 < 200ms for GET endpoints
- **Scope**: All read APIs; excluded: search, export, report
- **Measurement**: APM tool (Datadog, New Relic); percentile from production
- **Priority**: Must-have
### P-2: Frontend Bundle
- **Target**: Initial load < 300 KB gzipped
- **Scope**: Main app bundle; excluded: vendor chunks for admin-only features
- **Measurement**: Build output; Lighthouse
- **Priority**: Must-have
Try With AI
Prompt 1: NFR Generation
"I'm building a [describe system]. Generate non-functional requirements for: Performance (latency, throughput, bundle size), Security (auth, encryption, audit), Scalability (concurrent users, data volume), Reliability (availability, recovery), and Observability (logs, metrics, alerts). Make every requirement measurable. Use the format from this chapter."
Prompt 2: Vague to Measurable
"I have these vague NFRs: [paste]. Rewrite each as a measurable requirement. For each, specify: the metric, the target value, the measurement method, and the priority. Use industry benchmarks where appropriate."
Prompt 3: NFR-Driven Architecture
"Given these NFRs: [paste]. What architectural decisions do they imply? List: (1) technology choices, (2) patterns to use, (3) patterns to avoid, (4) infrastructure requirements. Explain the reasoning for each."
Prompt 4: Constraint Derivation
"From these NFRs: [paste]. Derive 5-10 implementation constraints that would help achieve them. Format as 'Must not X' or 'Must Y'. Explain how each constraint supports the NFR."
Practice Exercises
Exercise 1: NFR Audit
Take an existing project (or the collaboration platform spec). List all implied NFRs that are currently undocumented. For each, write a measurable NFR using this chapter's format. Identify gaps (e.g., no security NFRs, no scalability targets).
Expected outcome: A complete NFR document. You will discover that most projects have implicit NFRs that were never written down.
Exercise 2: Trade-Off Analysis
Choose two NFRs that can conflict: e.g., "p95 < 100ms" vs. "Strong consistency." Research how systems resolve this (e.g., eventual consistency, read replicas). Write a short section for your spec: "When NFRs conflict, [attribute] takes precedence because [reason]."
Expected outcome: A documented trade-off policy. You will understand that NFRs are not always simultaneously achievable.
Exercise 3: NFR Validation
For the collaboration platform NFRs in this chapter, design a validation plan: How would you verify each NFR? What tools? What tests? Create a checklist: NFR → Validation method → Pass criteria.
Expected outcome: A validation matrix. You will see that measurable NFRs enable automated validation (load tests, security scans, monitoring).
Key Takeaways
-
Non-functional requirements define how well the system performs: performance, security, scalability, reliability, observability. They are as critical as functional requirements.
-
Measurability is mandatory. "Fast" is not an NFR. "p95 < 200ms" is. Every NFR must be specific, measurable, and testable.
-
Five quality attributes cover production readiness: Performance (latency, throughput, bundle), Security (auth, encryption, audit), Scalability (users, data, geography), Reliability (uptime, recovery, fault tolerance), Observability (logs, metrics, alerts).
-
NFRs drive architecture. AI uses NFRs to choose caching, async patterns, horizontal scaling, and instrumentation. Vague NFRs produce generic implementations.
-
NFRs and constraints are related: NFRs define targets; constraints prohibit patterns that would violate them. Constraints often derive from NFRs.
-
Document trade-offs when NFRs conflict. "Latency over consistency" or "Security over convenience" — make the priority explicit.
Chapter Quiz
-
Why must non-functional requirements be measurable? Give an example of a vague NFR and its measurable equivalent.
-
What are the five core quality attributes for NFRs, and what does each address?
-
Write a performance NFR for API latency that specifies p50, p95, and p99. Why use percentiles instead of average?
-
What is the difference between RPO and RTO? How do they relate to backup and recovery strategies?
-
For a collaboration platform with real-time chat, what observability metrics would you define? List at least 5.
-
How does the "No PII in logs" security NFR affect implementation? What would an AI need to do to comply?
-
Explain the relationship between NFRs and constraints. When would you derive a constraint from an NFR?
-
How do scalability NFRs (e.g., "10,000 concurrent users") influence the choice of session storage (in-memory vs. Redis)?