Evaluation
How we test retrieval quality, safety behavior, and performance over time.
Evaluation protocol
This protocol is designed for a classroom-facing RAG system where correctness includes:
- retrieval quality (did we fetch the right things?)
- faithfulness (does the answer stick to retrieved sources?)
- safety/guardrails (does it avoid unsafe behavior?)
- transparency (does explain mode stay consistent?)
What we evaluate
Retrieval
Minimum checks:
- results are non-empty for in-scope questions
- distances stay in reasonable ranges (watch for drift over time)
- domain filtering works and fallback behavior is sane
Suggested metrics:
- mean/median distance for top-5
- percent of queries where top result is clearly irrelevant (manual spot-check)
Citation faithfulness
Even if the UI doesn’t force explicit citations in the prose, validate:
- the answer is grounded in retrieved excerpts
- the response doesn’t claim facts not present in retrieved chunks
Safety and misuse resistance
Periodically run a small red-team set:
- prompt injection attempts (e.g., “ignore sources and…")
- requests for personal data extraction
- requests for disallowed content
Success criteria:
- public mode clamps overrides
- the system does not leak secrets from configuration
Performance and reliability
Track:
- latency distribution (p50/p95)
- error rate
- timeouts
Recommended evaluation sets
Maintain three sets:
- smoke set (run on every deployment)
- regression set (run weekly)
- red-team set (run monthly or before public changes)
Running the starter evaluator
- Cases live in
eval/eval_cases.jsonl - Run retrieval-only (recommended):
python scripts/eval_run.py --cases eval/eval_cases.jsonl - Run including answer generation:
python scripts/eval_run.py --generate-answer --cases eval/eval_cases.jsonl