Ethical data literature & class prompts
What ethical data means in a sources-first RAG project, plus classroom prompts for evidence-based curation.
This project is a sources-first RAG system. In practice, the “training dataset” we’re building right now is:
- the retrieval corpus (documents you store and retrieve), and
- the evaluation set (queries + expected sources).
If we ever add fine-tuning later, we’ll hold it to the same standards.
What we took from the literature (and common practice)
These themes show up over and over in practitioner guidance and dataset documentation work:
- Provenance is non-negotiable: you should be able to answer “where did this come from, when did we get it, and what version is it?”
- Permission beats accessibility: “publicly accessible” does not automatically mean “permitted to store, redistribute, or embed.”
- Data minimization: don’t collect sensitive or personal data unless the project truly needs it.
- Respect service boundaries: no paywall circumvention; comply with ToS/robots where applicable; rate limit.
- Auditability and removal: you need a deletion path—if a source is later found to be restricted, you can remove it and regenerate derived artifacts.
- Representation and bias: monitor what perspectives dominate (region/language/institution) and treat imbalance as a measurable risk.
Our working definition: “ethical data”
For our pipeline, a document is “ethical to include” when it is:
- Traceable: we have source URL (or stable ID), retrieval time, and cryptographic hash.
- Permissioned: we have license/permission evidence that supports the chosen ingest policy.
- Non-sensitive by default: the content does not contain personal/sensitive data, or it has been minimized/redacted.
- Non-circumvented: collected without bypassing access controls.
- Reviewable: a human can inspect and approve/reject with a clear reason.
A reality check from our own ingestion
- Local ingestion (e.g., course folders) gives strong provenance (file path + hash), but it does not automatically establish reuse rights.
- So we treat those items as unreviewed until we record evidence, then decide
ALLOW_FULLTEXTvsMETADATA_ONLY.
Class prompts (what we ask students to decide)
We use prompts like these to involve the class in defining and auditing “ethical data”:
- Permission question: What evidence is sufficient to mark a source as “approved for full-text”? (license page link, publisher policy, government open license statement, written permission)
- Boundary question: When should a source be
METADATA_ONLYeven if it is relevant? (unclear license, ambiguous redistribution rights) - Sensitivity question: What categories of content should be blocked regardless of relevance? (PII-heavy docs, minors, medical records, doxxing)
- Representation question: How do we ensure the corpus is not dominated by one region/language? What minimum coverage targets should we set?
- Accountability question: If a user reports a source as copyrighted, what is the response timeline and removal workflow?
Rule of thumb we actually use
If you cannot point to permission evidence quickly, treat the document as unverified and keep it metadata-only until reviewed.