Testing protocol · Docs · AIethicsChat

How we test retrieval quality, grounding, and ethics/compliance risks over time.

This page answers: “How do we test what works and what doesn’t?”

For a sources-first RAG system, we test three things:

Build a small, high-quality evaluation set

Create a table/spreadsheet (or DB table later) with:

Start with 50–200 queries (enough to measure drift, but small enough to review).

For each query, run retrieval and compute:

Slice results by domain/jurisdiction/language to detect systematic gaps.

Human rubric (quick to apply):

Citation precision: do cited passages actually support the claim?
Quote correctness: are quotes accurate and not out of context?
Uncertainty handling: does the assistant say “unknown” when sources don’t support an answer?

Minimum expectation: any strong factual claim should be backed by at least one retrieved citation.

We also run “red team” tests that specifically probe:

Copyright leakage: does the system reproduce long verbatim passages?
- Expectation: short excerpts + citations + avoid full-document reproduction.
PII exposure: if any PII is in the corpus, can the system be prompted to reveal it?
- Expectation: avoid ingesting PII; add redaction; refusal behaviors.
Prompt injection in sources: can a malicious PDF/web page override system instructions?
- Expectation: instruction hierarchy + treat retrieved text as untrusted.

Track these dashboard KPIs per corpus slice:

If quality improves but rights evidence declines, treat that as a project risk (not a success).