Ethical data & evaluation sets
We’re building a sources-first RAG system, so data ethics here is mostly about what we ingest, what we store, and what we allow the assistant to retrieve and cite.
One important clarification: “training dataset” can mean different things in our context:
- Retrieval corpus: the curated documents we retrieve from (PDFs → extracted text → chunks → embeddings).
- Evaluation set: test cases (queries + expected sources/answers) used to measure quality.
- Fine-tuning data: only if/when we later train or fine-tune a model.
The guidance below applies to the first two (and extends to fine-tuning later).
What counts as “ethical data” in practice
Provenance & consent expectations (where it came from)
Prefer sources that are clearly intended for public reading/redistribution (for example: official policy PDFs, standards with permissible use, open-access academic papers). Avoid scraping behind logins/paywalls or bypassing access controls.
Lawfulness & licensing (what permission we can evidence)
Document the license and terms of use for every source—even if it is publicly accessible. Treat “no explicit license” as higher risk, not “free to use.”
Privacy by design (what we avoid collecting)
Minimize personal data. Avoid collecting sensitive or identifiable information unless there is a strong justification and safeguards. Add PII detection/redaction for any web/text ingestion.
Fairness & representation (coverage gaps are measurable)
Track what perspectives/regions/languages are over- or under-represented. Don’t let the corpus silently become “mostly US / English / industry.” Explicitly measure coverage gaps.
Harm minimization (what we exclude or constrain)
Exclude or flag content that could enable harm (for example: doxxing instructions, targeted harassment). Create rules for sensitive topics (medical, legal, minors) and ensure the assistant responds safely.
Transparency & accountability (audit trail)
Use dataset documentation patterns (e.g., “datasheets”-style documentation) and keep an auditable trail:
- What we ingested
- Why we ingested it
- Where it came from
- Under what terms
- What we removed and why
Our policy + curation workflow (lightweight, but real)
What our policy document defines
- Scope: what we collect (AI ethics, policy, standards, academic), and what we explicitly don’t.
- Source eligibility: allowed domains/types; disallowed sources (paywalled, ToS-prohibited scraping, personal data dumps, unclear provenance).
- Licensing rules: accepted licenses/terms; required attribution; when to store full text vs metadata-only.
- Privacy: PII scanning, redaction standards, retention/deletion, incident response.
- Quality + bias controls: required metadata (region, language, date, doc type), dedupe, and “balanced coverage” targets.
- Governance: who reviews, escalation path, and a change log.
The workflow we follow
- Intake: add a candidate source with metadata (URL/file, publisher, date, license/ToS notes, why it matters).
- Automated checks: file safety scan; PII scan; language detection; dedupe; license/robots/ToS checklist.
- Human review: approve / approve-with-constraints (e.g., “metadata-only”) / reject + reason.
- Tagging: jurisdiction, theme (privacy, fairness, labor, governance), doc type (law, guideline, paper), sensitivity level.
- Publish: ingest into the corpus with an immutable version ID; keep an audit record of who approved and why.
Why we maintain “different sets”
We maintain multiple curated collections so we can compare retrieval behavior and keep licensing/privacy expectations consistent:
- Policy-only
- Academic-only
- Standards-only
- Region-specific
Testing protocol: what works vs what doesn’t
Think of testing at three levels—retrieval quality, answer quality, and safety/compliance.
Retrieval tests (offline, repeatable)
- Build a gold set of queries with “must-cite” sources.
- Measure Recall@k / nDCG@k for whether the right documents appear in top-k.
- Track coverage metrics (performance by region/topic/doc type), not just overall.
Answer + citation tests (RAG-specific)
- Citation precision: do cited passages actually support the claim?
- Groundedness: answers should be traceable to retrieved text; penalize unsupported claims.
- Use a small human rubric: helpfulness, faithfulness, uncertainty handling, appropriate refusals.
Ethics/safety/compliance tests
- Privacy: ensure prompts can’t elicit personal data; test for PII regurgitation if any was ingested.
- Bias & representation: probe for stereotyping/omissions; compare across your curated collections.
- Copyright/licensing: ensure responses don’t reproduce long copyrighted passages; enforce excerpt limits + attribution.
- Security: prompt-injection tests on retrieved content (does a malicious page hijack the model’s behavior?).