Testing protocol
How we test retrieval quality, grounding, and ethics/compliance risks over time.
This page answers: “How do we test what works and what doesn’t?”
For a sources-first RAG system, we test three things:
- Retrieval quality: did we fetch the right sources?
- Grounding quality: are answers supported by the retrieved sources?
- Safety/ethics compliance: do we avoid privacy/copyright pitfalls?
Build a small, high-quality evaluation set
Create a table/spreadsheet (or DB table later) with:
queryexpected_sources(URLs or document IDs)domain(policy/standards/academic)jurisdictionandlanguagedifficultynotes(why this query matters)
Start with 50–200 queries (enough to measure drift, but small enough to review).
Retrieval metrics (objective)
For each query, run retrieval and compute:
- Recall@k: does an expected source appear in the top‑k?
- nDCG@k: are the best sources ranked higher?
Slice results by domain/jurisdiction/language to detect systematic gaps.
Grounding metrics (RAG-specific)
Human rubric (quick to apply):
- Citation precision: do cited passages actually support the claim?
- Quote correctness: are quotes accurate and not out of context?
- Uncertainty handling: does the assistant say “unknown” when sources don’t support an answer?
Minimum expectation: any strong factual claim should be backed by at least one retrieved citation.
Rights, privacy, and injection tests (ethics-first)
We also run “red team” tests that specifically probe:
Copyright leakage: does the system reproduce long verbatim passages?
- Expectation: short excerpts + citations + avoid full-document reproduction.
PII exposure: if any PII is in the corpus, can the system be prompted to reveal it?
- Expectation: avoid ingesting PII; add redaction; refusal behaviors.
Prompt injection in sources: can a malicious PDF/web page override system instructions?
- Expectation: instruction hierarchy + treat retrieved text as untrusted.
What “works” means operationally
Track these dashboard KPIs per corpus slice:
- % of PDFs with
ethical_review_status != unreviewed - % of PDFs with rights evidence recorded
- % of corpus in
ALLOW_FULLTEXTvsMETADATA_ONLY - Retrieval quality by slice (policy vs standards vs academic)
If quality improves but rights evidence declines, treat that as a project risk (not a success).