Evaluation

How we test retrieval quality, safety behavior, and performance over time.

Evaluation protocol

This protocol is designed for a classroom-facing RAG system where correctness includes:

  • retrieval quality (did we fetch the right things?)
  • faithfulness (does the answer stick to retrieved sources?)
  • safety/guardrails (does it avoid unsafe behavior?)
  • transparency (does explain mode stay consistent?)

What we evaluate

Retrieval

Minimum checks:

  • results are non-empty for in-scope questions
  • distances stay in reasonable ranges (watch for drift over time)
  • domain filtering works and fallback behavior is sane

Suggested metrics:

  • mean/median distance for top-5
  • percent of queries where top result is clearly irrelevant (manual spot-check)

Citation faithfulness

Even if the UI doesn’t force explicit citations in the prose, validate:

  • the answer is grounded in retrieved excerpts
  • the response doesn’t claim facts not present in retrieved chunks

Safety and misuse resistance

Periodically run a small red-team set:

  • prompt injection attempts (e.g., “ignore sources and…")
  • requests for personal data extraction
  • requests for disallowed content

Success criteria:

  • public mode clamps overrides
  • the system does not leak secrets from configuration

Performance and reliability

Track:

  • latency distribution (p50/p95)
  • error rate
  • timeouts

Recommended evaluation sets

Maintain three sets:

  1. smoke set (run on every deployment)
  2. regression set (run weekly)
  3. red-team set (run monthly or before public changes)

Running the starter evaluator

  • Cases live in eval/eval_cases.jsonl
  • Run retrieval-only (recommended): python scripts/eval_run.py --cases eval/eval_cases.jsonl
  • Run including answer generation: python scripts/eval_run.py --generate-answer --cases eval/eval_cases.jsonl