Data sources policy

Scope

This policy governs how we collect, store, and use source materials that feed AIethicsChat’s retrieval corpus.

It applies to:

This does not mean we are training a foundation model. The intended use is retrieval + citation-backed answers.

Provenance-first: every item must have a source URL or stable identifier.
Permission-aware: only include content we have the right to store and serve as excerpts.
Minimize harm: avoid collecting sensitive personal data; treat vulnerable populations with care.
Traceability: every chunk should trace back to a source item and license label.
Right to removal: provide a clear takedown process.

Prefer sources that are:

public domain
government/public-sector publications that are clearly public (or explicitly licensed)
open-access academic materials with clear reuse terms
Creative Commons licensed content (respecting restrictions)
content with explicit written permission to ingest

Do not ingest:

pirated copies of books/articles
paywalled materials unless we have explicit rights to store and use them
materials containing private personal data (PII) that isn’t essential to the educational goal
content whose license prohibits redistribution or derivative works (unless explicit permission covers this)

For each source item we ingest, record at least:

Avoid ingesting phone numbers, email addresses, home addresses, or similar identifiers.
If a source is valuable but contains incidental PII, prefer redaction at the transcript/chunk stage.
Avoid ingesting confidential student work unless there is written consent and access is restricted.

If a rights-holder requests removal: