CASE STUDY — Part VII: Production RAG Assistant (Eval, Drift, Safety)

SCENARIO

You're building a support assistant that answers questions from internal docs.

It must:

cite sources
avoid hallucinations
handle doc drift
be measurable

Why RAG is hard in production: RAG (Retrieval-Augmented Generation) combines retrieval (fetch relevant chunks) with generation (LLM produces answer). Each layer can fail: retrieval fetches wrong chunks, chunks are stale, the model ignores them and hallucinates, or the model refuses valid questions. In dev, a few hand-picked queries look great. In production, long-tail questions, adversarial inputs, and constantly changing docs expose weaknesses. You need eval, safety, drift detection, and phased rollout—or you ship a system that confidently gives wrong answers.

ARCHITECTURE

RAG pipeline:

Ingest: Docs → chunk → embed → index (vector store).
Query: User question → embed → retrieve top-k chunks.
Generate: Compose prompt with chunks + question → LLM → answer + citations.

Architecture diagram:

┌─────────────┐     ┌──────────────┐     ┌─────────────┐
│  Docs       │────►│  Chunker     │────►│  Embedder   │
│  (Markdown, │     │  (sentence,  │     │  (model X)  │
│   PDF)      │     │   paragraph) │     │             │
└─────────────┘     └──────────────┘     └──────┬──────┘
                                                │
                                                ▼
┌─────────────┐     ┌──────────────┐     ┌─────────────┐
│  User       │────►│  Retriever   │◄────│  Vector     │
│  Question   │     │  (top-k)    │     │  Index      │
└─────────────┘     └──────┬───────┘     └─────────────┘
                          │
                          ▼
                 ┌──────────────────┐
                 │  Prompt + Chunks  │────► LLM ────► Answer + Citations
                 └──────────────────┘

Chunking strategies: Sentence-based (small, precise), paragraph-based (more context), or semantic (split on topic boundaries). Tradeoff: smaller chunks = better retrieval precision, larger = more context per chunk. Start with 256–512 tokens; tune based on retrieval quality.

Embedding model selection: Use the same model for index and query. Mixing models (e.g., index with model A, query with model B) degrades retrieval. Track embedding model version—when you upgrade, re-embed everything.

EVALUATION (BEFORE SHIPPING)

Create a golden set: 50–200 representative questions with expected key points. Include edge cases: out-of-scope questions, ambiguous questions, questions that should trigger "I don't know."

Metrics table:

Metric	Definition	How to measure
Groundedness	Is every claim in the answer supported by retrieved chunks?	LLM-as-judge or rule-based: check if cited chunks contain the claim.
Faithfulness	Does the answer stay within the chunks? No external knowledge.	Same as groundedness; focus on "no unsupported claims."
Relevance	Does the answer address the question?	LLM-as-judge: score 1–5.
Retrieval quality	Did we fetch the right chunks?	Human eval: for each Q, are top-3 chunks relevant?
Refusal accuracy	For out-of-scope Q, does it say "I don't know" vs hallucinate?	Binary: correct refusal or not.

Eval pipeline: Run golden set weekly (or on every deploy). Store scores in a dashboard. Regression: if groundedness drops > 5%, block deploy or investigate.

# Example eval loop (pseudocode)
for question, expected_key_points in golden_set:
    chunks = retrieve(question, k=5)
    answer = generate(question, chunks)
    groundedness = check_groundedness(answer, chunks)
    relevance = llm_judge(question, answer)
    refusal_ok = (question in out_of_scope) == (answer says "I don't know")

SAFETY + GUARDRAILS

Pattern	What it does	Example
Restrict tools/actions	Assistant can only call approved APIs. No arbitrary code execution.	Allow only `search_docs`, `create_ticket`. Block `execute_shell`.
Block prompt injection	Detect and reject inputs that try to override system prompt.	"Ignore previous instructions and..." → refuse.
Strip untrusted HTML	User pastes HTML; sanitize before including in prompt.	Remove `<script>`, `onclick`, etc.
Enforce "cite or refuse"	Every factual claim must have a citation. If no chunk supports it, refuse.	In prompt: "Only use information from the provided context. If unsure, say 'I don't know.'"
PII detection	Don't leak user PII in logs or to the model.	Redact SSN, credit cards before logging.
Output filtering	Post-process model output. Block toxic or off-topic content.	Keyword blocklist, classifier for harmful content.

Input sanitization example: Truncate inputs to max length (e.g., 4K chars). Reject inputs with high ratio of special characters (potential injection). Escape or remove control characters.

Output filtering: Run answer through a lightweight classifier before returning. If confidence of "harmful" > threshold, return a generic refusal.

DRIFT + MONITORING

Monitor:

Retrieval miss rate: % of queries where top-k chunks have no relevant chunk. High = chunking or embedding issue.
Citation coverage: % of answers with at least one citation. Low = model ignoring context.
User feedback: "Was this helpful?" thumbs up/down. Track by question type.
Doc freshness lag: Time from doc update to re-embed and re-index. Alert if > 24h for critical docs.

Re-embed on:

Doc changes (new, updated, deleted).
Embedding model changes (version upgrade).

Drift detection specifics: If retrieval quality degrades over time, possible causes: (1) docs changed, chunks no longer match queries; (2) query distribution shifted (new question types); (3) embedding model behavior changed. Track embedding model version in metadata; when you upgrade, run full eval and compare.

ROLLOUT

Phased rollout with gates:

Phase	Audience	Gate to next phase
1	Internal team (10 users)	No critical bugs, groundedness > 90% on golden set
2	All internal (100+ users)	Feedback positive, retrieval miss rate < 10%
3	Beta customers (feature flag)	Citation coverage > 95%, no safety incidents
4	General availability	2 weeks in beta with no rollback

Add "report issue" on every answer: Let users flag wrong or harmful answers. Feed into eval set and incident response.

Feature flag + canary: Ship behind flag. Canary 5% of traffic to new model or retrieval config. Compare metrics (latency, feedback) before full rollout.

EXERCISE

Design a hallucination incident runbook:

Detection: How do you know a hallucination happened? (User report? Automated groundedness check? Spot audit?)
Mitigation: What do you do immediately? (Disable feature? Add question to blocklist? Revert model?)
Root cause: Was it retrieval (wrong chunks), prompt (model ignoring context), or model (inherent tendency to hallucinate)? How do you triage?
Prevention: What guardrails or evals would have caught this? Update golden set? Add output filter?
Communication: Who do you notify? How do you document for post-mortem?

Additional eval scenarios: Add to your golden set: (1) questions with multiple valid answers—does the model pick one and cite? (2) questions that require combining 2+ chunks—does retrieval get both? (3) adversarial questions designed to elicit refusals—does the model resist?

Latency and cost: RAG adds retrieval + embedding + generation. Track p50/p99 latency. If retrieval is slow, consider caching frequent queries or using a faster embedding model. Monitor token usage—long contexts cost more.

🏁 END — PART VII CASE STUDY