Ethical data literature & class prompts · Docs · AIethicsChat

What ethical data means in a sources-first RAG project, plus classroom prompts for evidence-based curation.

This project is a sources-first RAG system. In practice, the “training dataset” we’re building right now is:

If we ever add fine-tuning later, we’ll hold it to the same standards.

What we took from the literature (and common practice)

These themes show up over and over in practitioner guidance and dataset documentation work:

Provenance is non-negotiable: you should be able to answer “where did this come from, when did we get it, and what version is it?”
Permission beats accessibility: “publicly accessible” does not automatically mean “permitted to store, redistribute, or embed.”
Data minimization: don’t collect sensitive or personal data unless the project truly needs it.
Respect service boundaries: no paywall circumvention; comply with ToS/robots where applicable; rate limit.
Auditability and removal: you need a deletion path—if a source is later found to be restricted, you can remove it and regenerate derived artifacts.
Representation and bias: monitor what perspectives dominate (region/language/institution) and treat imbalance as a measurable risk.

For our pipeline, a document is “ethical to include” when it is:

Traceable: we have source URL (or stable ID), retrieval time, and cryptographic hash.
Permissioned: we have license/permission evidence that supports the chosen ingest policy.
Non-sensitive by default: the content does not contain personal/sensitive data, or it has been minimized/redacted.
Non-circumvented: collected without bypassing access controls.
Reviewable: a human can inspect and approve/reject with a clear reason.

Local ingestion (e.g., course folders) gives strong provenance (file path + hash), but it does not automatically establish reuse rights.
So we treat those items as unreviewed until we record evidence, then decide ALLOW_FULLTEXT vs METADATA_ONLY.

We use prompts like these to involve the class in defining and auditing “ethical data”:

Permission question: What evidence is sufficient to mark a source as “approved for full-text”? (license page link, publisher policy, government open license statement, written permission)
Boundary question: When should a source be METADATA_ONLY even if it is relevant? (unclear license, ambiguous redistribution rights)
Sensitivity question: What categories of content should be blocked regardless of relevance? (PII-heavy docs, minors, medical records, doxxing)
Representation question: How do we ensure the corpus is not dominated by one region/language? What minimum coverage targets should we set?
Accountability question: If a user reports a source as copyrighted, what is the response timeline and removal workflow?

If you cannot point to permission evidence quickly, treat the document as unverified and keep it metadata-only until reviewed.