CASE STUDY — Part VII: Production RAG Assistant (Eval, Drift, Safety)
SCENARIO
You're building a support assistant that answers questions from internal docs.
It must:
-
cite sources
-
avoid hallucinations
-
handle doc drift
-
be measurable
Why RAG is hard in production: RAG (Retrieval-Augmented Generation) combines retrieval (fetch relevant chunks) with generation (LLM produces answer). Each layer can fail: retrieval fetches wrong chunks, chunks are stale, the model ignores them and hallucinates, or the model refuses valid questions. In dev, a few hand-picked queries look great. In production, long-tail questions, adversarial inputs, and constantly changing docs expose weaknesses. You need eval, safety, drift detection, and phased rollout—or you ship a system that confidently gives wrong answers.
ARCHITECTURE
RAG pipeline:
-
Ingest: Docs → chunk → embed → index (vector store).
-
Query: User question → embed → retrieve top-k chunks.
-
Generate: Compose prompt with chunks + question → LLM → answer + citations.
Architecture diagram:
┌─────────────┐ ┌──────────────┐ ┌─────────────┐
│ Docs │────►│ Chunker │────►│ Embedder │
│ (Markdown, │ │ (sentence, │ │ (model X) │
│ PDF) │ │ paragraph) │ │ │
└─────────────┘ └──────────────┘ └──────┬──────┘
│
▼
┌─────────────┐ ┌──────────────┐ ┌─────────────┐
│ User │────►│ Retriever │◄────│ Vector │
│ Question │ │ (top-k) │ │ Index │
└─────────────┘ └──────┬───────┘ └─────────────┘
│
▼
┌──────────────────┐
│ Prompt + Chunks │────► LLM ────► Answer + Citations
└──────────────────┘
Chunking strategies: Sentence-based (small, precise), paragraph-based (more context), or semantic (split on topic boundaries). Tradeoff: smaller chunks = better retrieval precision, larger = more context per chunk. Start with 256–512 tokens; tune based on retrieval quality.
Embedding model selection: Use the same model for index and query. Mixing models (e.g., index with model A, query with model B) degrades retrieval. Track embedding model version—when you upgrade, re-embed everything.
EVALUATION (BEFORE SHIPPING)
Create a golden set: 50–200 representative questions with expected key points. Include edge cases: out-of-scope questions, ambiguous questions, questions that should trigger "I don't know."
Metrics table:
| Metric | Definition | How to measure |
|---|---|---|
| Groundedness | Is every claim in the answer supported by retrieved chunks? | LLM-as-judge or rule-based: check if cited chunks contain the claim. |
| Faithfulness | Does the answer stay within the chunks? No external knowledge. | Same as groundedness; focus on "no unsupported claims." |
| Relevance | Does the answer address the question? | LLM-as-judge: score 1–5. |
| Retrieval quality | Did we fetch the right chunks? | Human eval: for each Q, are top-3 chunks relevant? |
| Refusal accuracy | For out-of-scope Q, does it say "I don't know" vs hallucinate? | Binary: correct refusal or not. |
Eval pipeline: Run golden set weekly (or on every deploy). Store scores in a dashboard. Regression: if groundedness drops > 5%, block deploy or investigate.
# Example eval loop (pseudocode)
for question, expected_key_points in golden_set:
chunks = retrieve(question, k=5)
answer = generate(question, chunks)
groundedness = check_groundedness(answer, chunks)
relevance = llm_judge(question, answer)
refusal_ok = (question in out_of_scope) == (answer says "I don't know")
SAFETY + GUARDRAILS
| Pattern | What it does | Example |
|---|---|---|
| Restrict tools/actions | Assistant can only call approved APIs. No arbitrary code execution. | Allow only search_docs, create_ticket. Block execute_shell. |
| Block prompt injection | Detect and reject inputs that try to override system prompt. | "Ignore previous instructions and..." → refuse. |
| Strip untrusted HTML | User pastes HTML; sanitize before including in prompt. | Remove <script>, onclick, etc. |
| Enforce "cite or refuse" | Every factual claim must have a citation. If no chunk supports it, refuse. | In prompt: "Only use information from the provided context. If unsure, say 'I don't know.'" |
| PII detection | Don't leak user PII in logs or to the model. | Redact SSN, credit cards before logging. |
| Output filtering | Post-process model output. Block toxic or off-topic content. | Keyword blocklist, classifier for harmful content. |
Input sanitization example: Truncate inputs to max length (e.g., 4K chars). Reject inputs with high ratio of special characters (potential injection). Escape or remove control characters.
Output filtering: Run answer through a lightweight classifier before returning. If confidence of "harmful" > threshold, return a generic refusal.
DRIFT + MONITORING
Monitor:
-
Retrieval miss rate: % of queries where top-k chunks have no relevant chunk. High = chunking or embedding issue.
-
Citation coverage: % of answers with at least one citation. Low = model ignoring context.
-
User feedback: "Was this helpful?" thumbs up/down. Track by question type.
-
Doc freshness lag: Time from doc update to re-embed and re-index. Alert if > 24h for critical docs.
Re-embed on:
-
Doc changes (new, updated, deleted).
-
Embedding model changes (version upgrade).
Drift detection specifics: If retrieval quality degrades over time, possible causes: (1) docs changed, chunks no longer match queries; (2) query distribution shifted (new question types); (3) embedding model behavior changed. Track embedding model version in metadata; when you upgrade, run full eval and compare.
ROLLOUT
Phased rollout with gates:
| Phase | Audience | Gate to next phase |
|---|---|---|
| 1 | Internal team (10 users) | No critical bugs, groundedness > 90% on golden set |
| 2 | All internal (100+ users) | Feedback positive, retrieval miss rate < 10% |
| 3 | Beta customers (feature flag) | Citation coverage > 95%, no safety incidents |
| 4 | General availability | 2 weeks in beta with no rollback |
Add "report issue" on every answer: Let users flag wrong or harmful answers. Feed into eval set and incident response.
Feature flag + canary: Ship behind flag. Canary 5% of traffic to new model or retrieval config. Compare metrics (latency, feedback) before full rollout.
EXERCISE
Design a hallucination incident runbook:
-
Detection: How do you know a hallucination happened? (User report? Automated groundedness check? Spot audit?)
-
Mitigation: What do you do immediately? (Disable feature? Add question to blocklist? Revert model?)
-
Root cause: Was it retrieval (wrong chunks), prompt (model ignoring context), or model (inherent tendency to hallucinate)? How do you triage?
-
Prevention: What guardrails or evals would have caught this? Update golden set? Add output filter?
-
Communication: Who do you notify? How do you document for post-mortem?
Additional eval scenarios: Add to your golden set: (1) questions with multiple valid answers—does the model pick one and cite? (2) questions that require combining 2+ chunks—does retrieval get both? (3) adversarial questions designed to elicit refusals—does the model resist?
Latency and cost: RAG adds retrieval + embedding + generation. Track p50/p99 latency. If retrieval is slow, consider caching frequent queries or using a faster embedding model. Monitor token usage—long contexts cost more.