Testing protocol

How we test retrieval quality, grounding, and ethics/compliance risks over time.

This page answers: “How do we test what works and what doesn’t?”

For a sources-first RAG system, we test three things:

  1. Retrieval quality: did we fetch the right sources?
  2. Grounding quality: are answers supported by the retrieved sources?
  3. Safety/ethics compliance: do we avoid privacy/copyright pitfalls?

Build a small, high-quality evaluation set

Create a table/spreadsheet (or DB table later) with:

  • query
  • expected_sources (URLs or document IDs)
  • domain (policy/standards/academic)
  • jurisdiction and language
  • difficulty
  • notes (why this query matters)

Start with 50–200 queries (enough to measure drift, but small enough to review).

Retrieval metrics (objective)

For each query, run retrieval and compute:

  • Recall@k: does an expected source appear in the top‑k?
  • nDCG@k: are the best sources ranked higher?

Slice results by domain/jurisdiction/language to detect systematic gaps.

Grounding metrics (RAG-specific)

Human rubric (quick to apply):

  • Citation precision: do cited passages actually support the claim?
  • Quote correctness: are quotes accurate and not out of context?
  • Uncertainty handling: does the assistant say “unknown” when sources don’t support an answer?

Minimum expectation: any strong factual claim should be backed by at least one retrieved citation.

Rights, privacy, and injection tests (ethics-first)

We also run “red team” tests that specifically probe:

  • Copyright leakage: does the system reproduce long verbatim passages?

    • Expectation: short excerpts + citations + avoid full-document reproduction.
  • PII exposure: if any PII is in the corpus, can the system be prompted to reveal it?

    • Expectation: avoid ingesting PII; add redaction; refusal behaviors.
  • Prompt injection in sources: can a malicious PDF/web page override system instructions?

    • Expectation: instruction hierarchy + treat retrieved text as untrusted.

What “works” means operationally

Track these dashboard KPIs per corpus slice:

  • % of PDFs with ethical_review_status != unreviewed
  • % of PDFs with rights evidence recorded
  • % of corpus in ALLOW_FULLTEXT vs METADATA_ONLY
  • Retrieval quality by slice (policy vs standards vs academic)

If quality improves but rights evidence declines, treat that as a project risk (not a success).