Chapter 29: AI Governance and CI/CD Integration
Learning Objectives
By the end of this chapter, you will be able to:
- Define AI governance and explain why it is essential for AI-generated code
- Implement governance rules: mandatory code review, security scanning, test coverage thresholds (>90%)
- Add LLM-specific security controls (prompt injection, tool abuse, data exfiltration)
- Design and implement a spec-validated CI/CD pipeline with six stages
- Configure spec validation gates that block deployment on spec violations
- Define AI output review policies: when human review is required vs. automated approval
- Address compliance: audit trails, traceability from spec to deployed code
- Build a complete spec-validated CI/CD pipeline using GitHub Actions
- Apply governance frameworks for different team sizes (startup, mid-size, enterprise)
What Is AI Governance?
AI governance is the set of policies, processes, and controls that ensure AI-generated code remains safe, compliant, and high-quality. When AI writes code, you lose the implicit trust that comes from human authorship. Governance restores that trust through verification.
Without governance, AI-generated code can introduce:
- Security vulnerabilities — Hardcoded secrets, SQL injection, insecure dependencies
- Compliance violations — Missing audit trails, data handling that violates regulations
- Quality regressions — Untested code paths, broken contracts, performance degradation
- Spec drift — Code that diverges from specifications without detection
AI governance answers: How do we ensure that what AI produces is fit for production?
Core Principles
- Never trust, always verify — AI output is treated as untrusted until it passes all gates.
- Specification as contract — The spec is the source of truth; code must conform.
- Automated gates over manual review — Automate what can be automated; reserve human review for high-risk changes.
- Traceability — Every deployed artifact links back to its specification and generation context.
- Auditability — Decisions (approve, reject, override) are logged and reviewable.
Governance Rules
Governance rules are the concrete policies that enforce quality. They define what must pass before code reaches production.
Mandatory Code Review
Rule: All AI-generated code must be reviewed by a human before merge.
Rationale: AI can produce plausible but incorrect code. Human review catches logic errors, design mismatches, and context-specific issues that automated tests miss.
Implementation:
- Require at least one approval on pull requests
- For AI-generated PRs: require review from someone who understands the spec
- Option: Require two approvals for changes touching security-critical paths (auth, payments, data handling)
Exception: Automated approval when all gates pass and change is low-risk (e.g., typo fixes, dependency updates within policy). Define "low-risk" explicitly.
Security Scanning
Rule: All code must pass security scans before deployment.
Scans:
- SAST (Static Application Security Testing) — Scans source code for vulnerabilities (e.g., Semgrep, CodeQL, SonarQube)
- DAST (Dynamic Application Security Testing) — Scans running application (e.g., OWASP ZAP)
- Dependency audit — Scans dependencies for known CVEs (e.g.,
npm audit, Snyk, Dependabot)
Threshold: Zero high/critical vulnerabilities. Medium/low may be allowed with documented exceptions and remediation plans.
Example policy:
security:
sast: required
dast: required-on-release
dependency-audit: required
fail-on: [critical, high]
allow-medium-with: documented-exception
LLM-Specific Security Controls
Traditional AppSec scanning is necessary but insufficient for agentic systems. Add explicit controls for:
- Prompt injection: Untrusted content (docs, tickets, webpages) attempting to override system instructions.
- Tool misuse / excessive agency: Agent calling tools outside task scope.
- Data exfiltration: Sensitive data sent to external tools, logs, or model context.
- Untrusted retrieval poisoning: Malicious or stale context loaded through search/retrieval.
Policy examples:
llm-security:
prompt-injection-tests: required
tool-allowlist: required
network-egress-policy: restricted
pii-redaction-before-model: required
retrieval-source-trust-levels: required
Map these controls to threat models such as OWASP LLM Top 10 and enforce them as CI gates where possible.
Test Coverage Thresholds
Rule: Coverage thresholds should be risk-tiered. Use 90%+ for high-risk paths; lower thresholds may be acceptable for low-risk changes with strong contract/integration tests.
Rationale: AI-generated code is more likely to have untested edge cases. High coverage reduces risk.
Implementation:
- Measure coverage for files touched in the PR
- Block merge if coverage drops below the threshold for the touched risk tier
- Alternative: Measure specification coverage — percentage of spec requirements with passing tests (see Chapter 25)
Example:
coverage:
low-risk-minimum: 75
medium-risk-minimum: 85
high-risk-minimum: 90
scope: changed-files
metric: line-coverage # or spec-coverage
Spec Validation
Rule: Code must conform to the specification. Spec violations block deployment.
Implementation:
- Contract tests validate API against OpenAPI spec
- Integration tests validate against acceptance criteria
- Drift detection: compare generated code structure to spec (e.g., required endpoints, fields)
LLM-as-a-Judge in CI/CD
Rule: Automate the qualitative review of PRs using an LLM configured as a "Review Agent."
Rationale: While linters and tests catch binary pass/fail conditions, an LLM can grade the intent of the code against the specification rubric (from Chapter 9).
Implementation:
- Add a CI step that triggers a Review Agent on PR creation.
- The agent reads
spec.mdand the PR diff. - It automatically grades the PR and leaves comments on lines that violate Constraints or fail to meet Acceptance Criteria.
- If the LLM-as-a-Judge scores the PR below a threshold, the CI pipeline fails.
The Spec-Validated CI/CD Pipeline
The spec-validated pipeline extends traditional CI/CD with specification-centric gates. Every stage validates against the spec or its derived artifacts.
Pipeline Overview
┌─────────────────────────────────────────────────────────────────────────────┐
│ SPEC-VALIDATED CI/CD PIPELINE │
├─────────────────────────────────────────────────────────────────────────────┤
│ 1. Specification Validation → Spec complete, consistent, lint-clean │
│ 2. Code Generation → AI generates from spec (or human writes) │
│ 3. Test Execution → Contract, integration, e2e, property │
│ 4. Security Scanning → SAST, DAST, dependency audit │
│ 5. Performance Validation → Latency budgets, bundle size │
│ 6. Deployment → If all gates pass, deploy with spec tag │
└─────────────────────────────────────────────────────────────────────────────┘
Stage 1: Specification Validation
Purpose: Ensure the specification is complete, consistent, and ready for implementation.
Checks:
- Spec linting (structure, required sections, link validity)
- Completeness: all acceptance criteria have testable assertions
- Consistency: no conflicting requirements
- Contract validity: OpenAPI/AsyncAPI specs are valid
Tools: Custom spec linters, OpenAPI validators, markdown linting
Gate: Fail if spec is incomplete or invalid. Block code generation until spec passes.
Stage 2: Code Generation
Purpose: Produce code from the specification.
Process:
- AI generates code from spec (or human implements)
- Generated code is committed to a branch
- PR is opened for review
Governance: Code generation may happen outside CI (e.g., in IDE or agent). CI receives the result and validates it.
Stage 3: Test Execution
Purpose: Verify implementation satisfies the specification.
Test types (from Chapter 25–27):
- Contract tests — API responses match OpenAPI schema
- Integration tests — Acceptance criteria pass
- E2E tests — User journeys complete
- Property tests — Invariants hold for generated inputs
Gate: All tests must pass. No flaky test bypass without documented exception.
Stage 4: Security Scanning
Purpose: Detect vulnerabilities before deployment.
Scans:
- SAST on source code
- Dependency audit (npm audit, Snyk, etc.)
- DAST on staging (if available)
Gate: No critical/high vulnerabilities. Medium requires exception.
Stage 5: Performance Validation
Purpose: Ensure performance budgets are met.
Checks:
- Latency budgets (e.g., P95 < 200ms for API endpoints)
- Bundle size (e.g., main bundle < 500KB)
- Lighthouse scores (accessibility, performance)
Gate: Fail if budgets exceeded. Allow override with approval for justified cases.
Stage 6: Deployment
Purpose: Deploy only when all gates pass.
Process:
- Tag release with spec version (e.g.,
v1.2.3-spec-005-bookmarks) - Deploy to staging, then production
- Record traceability: deployment → spec → commit
Spec Validation Gates
Spec validation gates are the mechanisms that block deployment when code diverges from the specification.
Gate Types
| Gate | What It Checks | Failure Action |
|---|---|---|
| Spec lint | Spec structure, completeness | Block PR; fix spec |
| Contract test | API matches OpenAPI | Block merge |
| Integration test | Acceptance criteria pass | Block merge |
| Coverage | Tests cover spec requirements | Block merge if below threshold |
| LLM security gate | Prompt injection resilience, tool allowlists, redaction policies | Block merge on policy violation |
| Drift detection | Code structure matches spec | Alert or block |
Drift Detection
Drift occurs when code diverges from the spec without the spec being updated. Detection strategies:
- Contract tests — If API behavior changes, contract tests fail. Update spec or fix code.
- Schema comparison — Compare generated types to spec schema; flag mismatches.
- Requirement traceability — Each test links to a requirement. If code changes break a test, the requirement is violated.
- Manual review — Reviewer checks that implementation matches spec.
Automated drift detection (advanced): Parse spec for declared endpoints, fields, behaviors. Compare to implementation. Flag additions or removals not in spec.
AI Output Review Policies
When is human review required? When can automated approval suffice?
Policy Matrix
| Change Type | Risk | Human Review | Automated Approval |
|---|---|---|---|
| Typo fix, comment | Low | Optional | Yes, if tests pass |
| New feature (from spec) | Medium | Required | No |
| Bug fix (small scope) | Medium | Required | No (or 1 approval) |
| Security-related | High | Required (2 approvals) | Never |
| Dependency update | Medium | Required for major | Minor/patch: automated if no CVEs |
| Refactor (no behavior change) | Low | Optional | Yes, if tests pass |
High-Risk Paths
Define paths that always require human review:
- Authentication, authorization
- Payment processing
- Data export, deletion, PII handling
- Configuration that affects production behavior
Example:
high-risk-paths:
- "**/auth/**"
- "**/payments/**"
- "**/user-data/**"
- "**/config/production*"
review-required: always
min-approvals: 2
Automated Approval Criteria
Automated approval (e.g., Dependabot auto-merge) may be allowed when:
- All tests pass
- No security findings
- Change is in allowlist (e.g., patch version bumps)
- No high-risk paths touched
Compliance Considerations
Enterprises often face regulatory requirements. SDD can support compliance through traceability and audit trails.
Audit Trails
What to log:
- Specification versions and changes
- Code generation events (when, from which spec, which model)
- Review decisions (approved, rejected, by whom)
- Deployment events (what was deployed, when, from which spec)
- Test results (pass/fail, coverage)
Retention: Per regulatory requirements (e.g., 7 years for financial).
Format: Structured logs (JSON) for querying. Store in SIEM or audit system.
Traceability
Spec → Code → Deployment:
- Each deployment links to: commit SHA, spec version, PR, reviewer
- Each spec links to: requirements, acceptance criteria, tests
- Each test links to: requirement ID
Compliance value: Auditors can trace any production behavior back to a requirement and verify it was specified and tested.
Example Traceability Record
{
"deployment_id": "dep-2024-03-15-001",
"timestamp": "2024-03-15T14:32:00Z",
"spec_version": "005-bookmarks-v1.2",
"spec_path": "specs/005-bookmarks/spec.md",
"commit": "a1b2c3d",
"pr": "123",
"reviewer": "jane@example.com",
"tests_passed": true,
"security_scan": "passed",
"coverage": 94
}
Tutorial: Build a Spec-Validated CI/CD Pipeline with GitHub Actions
This tutorial walks you through building a complete spec-validated CI/CD pipeline using GitHub Actions. You will implement: spec linting, contract tests, security scan, performance budget, and deployment with spec version tagging.
Prerequisites
- A GitHub repository with:
- Specifications in
specs/ - OpenAPI contract in
specs/005-bookmarks/contracts/openapi.yaml - Source code (e.g., Node.js/TypeScript API)
- Tests (contract, integration)
- Specifications in
- GitHub Actions enabled
Step 1: Create the Workflow File
Create .github/workflows/spec-validated-pipeline.yml:
name: Spec-Validated Pipeline
on:
push:
branches: [main, develop]
pull_request:
branches: [main, develop]
env:
NODE_VERSION: '20'
jobs:
# Stage 1: Specification Validation
spec-validation:
name: Spec Validation
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Setup Node
uses: actions/setup-node@v4
with:
node-version: ${{ env.NODE_VERSION }}
cache: 'npm'
- name: Install spec linter
run: npm install -g @anthropic/spec-linter 2>/dev/null || echo "Using custom script"
- name: Lint specifications
run: |
echo "## Spec Lint Results" >> $GITHUB_STEP_SUMMARY
for spec in specs/**/spec.md; do
if [ -f "$spec" ]; then
echo "Linting $spec"
# Custom: check for required sections
grep -q "## Acceptance Criteria" "$spec" && echo "✓ $spec has AC" || (echo "✗ $spec missing AC" && exit 1)
grep -q "## Requirements" "$spec" && echo "✓ $spec has Requirements" || (echo "✗ $spec missing Requirements" && exit 1)
fi
done
- name: Validate OpenAPI contracts
run: |
npx @redocly/cli lint specs/**/contracts/*.yaml specs/**/contracts/*.json 2>/dev/null || true
# Alternative: use openapi-typescript or spectral
for contract in specs/**/contracts/*.yaml; do
if [ -f "$contract" ]; then
echo "Validating $contract"
npx @apidevtools/swagger-cli validate "$contract" || exit 1
fi
done
Step 2: Add Contract Test Job
# Stage 2 & 3: Build and Test
contract-tests:
name: Contract Tests
needs: spec-validation
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Setup Node
uses: actions/setup-node@v4
with:
node-version: ${{ env.NODE_VERSION }}
cache: 'npm'
- name: Install dependencies
run: npm ci
- name: Run contract tests
run: npm run test:contract
env:
CI: true
- name: Run integration tests
run: npm run test:integration
- name: Run all tests with coverage
run: npm run test:coverage
- name: Check coverage threshold
run: |
COVERAGE=$(cat coverage/coverage-summary.json | jq '.total.lines.pct')
if (( $(echo "$COVERAGE < 90" | bc -l) )); then
echo "Coverage $COVERAGE% is below 90% threshold"
exit 1
fi
echo "Coverage $COVERAGE% meets threshold"
Step 3: Add Security Scan Job
# Stage 4: Security Scanning
security-scan:
name: Security Scan
needs: spec-validation
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Setup Node
uses: actions/setup-node@v4
with:
node-version: ${{ env.NODE_VERSION }}
cache: 'npm'
- name: Install dependencies
run: npm ci
- name: Run npm audit
run: npm audit --audit-level=high
continue-on-error: false
- name: Run Semgrep (SAST)
uses: returntocorp/semgrep-action@v1
with:
config: >-
p/security-audit
p/typescript
env:
SEMGREP_RULES: auto
- name: Run Snyk (optional)
uses: snyk/actions/node@master
continue-on-error: true
env:
SNYK_TOKEN: ${{ secrets.SNYK_TOKEN }}
with:
args: --severity-threshold=high
Step 4: Add Performance Budget Job
# Stage 5: Performance Validation
performance-budget:
name: Performance Budget
needs: [contract-tests, security-scan]
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Setup Node
uses: actions/setup-node@v4
with:
node-version: ${{ env.NODE_VERSION }}
cache: 'npm'
- name: Build
run: npm run build
- name: Check bundle size
run: |
# For frontend: check bundle size
if [ -d "dist" ]; then
SIZE=$(du -sb dist | cut -f1)
MAX_SIZE=524288 # 512KB
if [ $SIZE -gt $MAX_SIZE ]; then
echo "Bundle size $SIZE exceeds $MAX_SIZE bytes"
exit 1
fi
fi
- name: API latency check (if applicable)
run: |
# Start server, run latency test
npm run start &
sleep 5
# Use curl or k6 for latency
LATENCY=$(curl -o /dev/null -s -w '%{time_total}' http://localhost:3000/health || echo "1")
echo "Latency: ${LATENCY}s"
# Fail if > 200ms
if (( $(echo "$LATENCY > 0.2" | bc -l) )); then
echo "Latency exceeds 200ms budget"
exit 1
fi
Step 5: Add Deployment Job with Spec Version Tagging
# Stage 6: Deployment
deploy:
name: Deploy
needs: [contract-tests, security-scan, performance-budget]
runs-on: ubuntu-latest
if: github.ref == 'refs/heads/main' && github.event_name == 'push'
steps:
- uses: actions/checkout@v4
with:
fetch-depth: 0
- name: Get spec version
id: spec
run: |
SPEC_VERSION=$(git describe --tags --always)
echo "version=$SPEC_VERSION" >> $GITHUB_OUTPUT
echo "Spec version: $SPEC_VERSION"
- name: Create deployment record
run: |
echo '{"deployment":"'${{ github.run_id }}'","spec":"'${{ steps.spec.outputs.version }}'","commit":"'${{ github.sha }}'"}' > deployment-record.json
- name: Deploy to staging
run: |
# Your deployment command, e.g.:
# npm run deploy:staging
echo "Deploying to staging with spec tag ${{ steps.spec.outputs.version }}"
- name: Deploy to production (manual approval in production)
if: false # Enable for production with environment protection
run: echo "Production deployment would run here"
Step 6: Add Spec Linter Script
Create scripts/spec-lint.sh for more robust spec validation:
#!/bin/bash
set -e
ERRORS=0
for spec in specs/*/spec.md; do
[ -f "$spec" ] || continue
echo "Linting $spec..."
# Required sections
for section in "## Acceptance Criteria" "## Requirements" "## Overview"; do
if ! grep -q "$section" "$spec"; then
echo " ERROR: Missing $section"
ERRORS=$((ERRORS + 1))
fi
done
# Check for at least one requirement
if ! grep -q "FR-|AC-|NFR-" "$spec"; then
echo " ERROR: No requirements found (FR-, AC-, NFR-)"
ERRORS=$((ERRORS + 1))
fi
done
[ $ERRORS -eq 0 ] || exit 1
Step 7: Configure Package.json Scripts
Ensure your package.json has:
{
"scripts": {
"test:contract": "vitest run tests/contract",
"test:integration": "vitest run tests/integration",
"test:coverage": "vitest run --coverage",
"spec:lint": "bash scripts/spec-lint.sh"
}
}
Step 8: Add Branch Protection
In GitHub: Settings → Branches → Add rule for main:
- Require status checks:
spec-validation,contract-tests,security-scan,performance-budget - Require pull request before merging
- Require 1 approval (or 2 for high-risk paths)
Governance Frameworks by Team Size
Governance should scale with team size and risk. One size does not fit all.
Startup (1–10 engineers)
Focus: Speed with safety. Minimal overhead.
Governance:
- Spec lint + contract tests in CI
- 1 mandatory review (can be async)
- npm audit (basic dependency check)
- Coverage: 80% (or skip for MVP)
- No formal audit trail; use git history
Pipeline: 3 stages — spec validation, test, deploy
Mid-Size (10–50 engineers)
Focus: Consistency and quality. Some compliance.
Governance:
- Full 6-stage pipeline
- 1–2 approvals depending on path
- SAST + dependency audit
- Coverage: 90%
- Basic traceability: deployment → spec → commit in release notes
Pipeline: All 6 stages
Enterprise (50+ engineers)
Focus: Compliance, auditability, risk management.
Governance:
- Full pipeline + DAST, penetration testing
- 2 approvals for security-critical paths
- Coverage: 90%+; spec coverage tracked
- Full audit trail: SIEM integration, retention policy
- Governance Engineer role
- Regular compliance reviews
Pipeline: All 6 stages + compliance gates, manual approval for production
Try With AI
Prompt 1: Governance Policy Design
"Our team is adopting SDD. We have 15 engineers and handle user PII. Design an AI output review policy: when is human review required vs. automated approval? Include a table of change types and risk levels. Suggest high-risk paths that always need review."
Prompt 2: Pipeline Gate Configuration
"I have a GitHub Actions workflow. Add a spec validation stage that: (1) lints all markdown specs in specs/ for required sections (Overview, Requirements, Acceptance Criteria), (2) validates OpenAPI files in specs/**/contracts/. Show the YAML and any scripts needed."
Prompt 3: Traceability Implementation
"We need audit trail for compliance. Design a JSON schema for deployment records that links: deployment ID, timestamp, spec version, spec path, commit SHA, PR number, reviewer, test results, security scan result. Include how to store and query these records."
Prompt 4: Drift Detection Strategy
"How can we detect when code diverges from the specification without the spec being updated? List 3–4 strategies (e.g., contract tests, schema comparison). For each, explain what it catches and what it might miss."
Practice Exercises
Exercise 1: Add a Spec Validation Gate
Take an existing project with specs. Add a GitHub Actions job (or equivalent) that:
- Lints specs for required sections
- Validates OpenAPI contracts
- Fails the pipeline if either fails
Run it on a PR and verify it blocks when you introduce an invalid spec.
Expected outcome: A working spec validation gate in CI.
Exercise 2: Define Your Review Policy
For your team (or a hypothetical team), create an AI output review policy document. Include:
- Change types and risk levels
- When human review is required
- High-risk paths
- Automated approval criteria
Expected outcome: A 1–2 page policy document.
Exercise 3: Implement a Coverage Gate
Add a coverage threshold check to your CI pipeline. If coverage drops below 90% (or 80% for existing projects) for changed files, fail the build. Use your test runner's coverage output (e.g., Vitest, Jest, pytest-cov).
Expected outcome: CI fails when coverage drops below threshold.
Key Takeaways
-
AI governance ensures AI-generated code is safe, compliant, and high-quality through policies, gates, and verification. Never trust, always verify.
-
Governance rules include mandatory code review, security scanning (SAST, DAST, dependency audit), test coverage thresholds (>90%), and spec validation. Define when human review is required vs. automated approval.
-
The spec-validated CI/CD pipeline has six stages: specification validation, code generation, test execution, security scanning, performance validation, and deployment. Each stage can block deployment.
-
Spec validation gates block deployment when code diverges from the spec. Use contract tests, integration tests, coverage thresholds, and drift detection.
-
Compliance requires audit trails and traceability: deployment → spec → commit. Log review decisions, generation events, and test results. Retain per regulatory requirements.
-
Governance scales with team size: startups need minimal overhead; enterprises need full auditability and compliance. Adapt the framework to your context.
Chapter Quiz
-
What is AI governance, and why is it essential when using AI-generated code?
-
List the six stages of the spec-validated CI/CD pipeline. What does each stage validate?
-
When should human review be required for AI-generated code? When might automated approval be acceptable?
-
What is drift detection? Name two strategies for detecting when code diverges from the specification.
-
What compliance considerations apply to SDD? What should an audit trail include?
-
How would you configure a coverage gate to block merge when coverage drops below 90% for changed files?
-
Compare governance frameworks for a 5-person startup vs. a 100-person enterprise. What differs?
-
Explain the traceability chain from specification to deployed code. Why is it valuable for compliance?
Back to: Part X Overview | Next: Chapter 30 — Metrics and Engineering Roles