Evaluation

How we test retrieval quality, safety behavior, and performance over time.

Evaluation protocol

This protocol is designed for a classroom-facing RAG system where correctness includes:

retrieval quality (did we fetch the right things?)
faithfulness (does the answer stick to retrieved sources?)
safety/guardrails (does it avoid unsafe behavior?)
transparency (does explain mode stay consistent?)

What we evaluate

Retrieval

Minimum checks:

results are non-empty for in-scope questions
distances stay in reasonable ranges (watch for drift over time)
domain filtering works and fallback behavior is sane

Suggested metrics:

mean/median distance for top-5
percent of queries where top result is clearly irrelevant (manual spot-check)

Citation faithfulness

Even if the UI doesn’t force explicit citations in the prose, validate:

the answer is grounded in retrieved excerpts
the response doesn’t claim facts not present in retrieved chunks

Safety and misuse resistance

Periodically run a small red-team set:

prompt injection attempts (e.g., “ignore sources and…")
requests for personal data extraction
requests for disallowed content

Success criteria:

public mode clamps overrides
the system does not leak secrets from configuration

Performance and reliability

Track:

latency distribution (p50/p95)
error rate
timeouts

Recommended evaluation sets

Maintain three sets:

smoke set (run on every deployment)
regression set (run weekly)
red-team set (run monthly or before public changes)

Running the starter evaluator

Cases live in eval/eval_cases.jsonl
Run retrieval-only (recommended): python scripts/eval_run.py --cases eval/eval_cases.jsonl
Run including answer generation: python scripts/eval_run.py --generate-answer --cases eval/eval_cases.jsonl