Data sources policy

How we decide what sources are allowed into the retrieval corpus, and what metadata is required for auditability.

This is the policy we follow when we add sources to the AIethicsChat retrieval corpus.

It covers:

The retrieval corpus (documents → extracted text → chunks/embeddings).
The metadata we keep for audit (title, URL, license notes, timestamps, hashes).
The derived artifacts we generate (chunks, embeddings, indexes).

We’re not training a foundation model here. The point is retrieval + citation-backed answers.

What we’re optimizing for

Provenance we can explain (where did this file come from, and which exact version is it?).
Permissions we can evidence (what rights basis supports how we store/use it?).
Low privacy risk (avoid PII; minimize what we store).
A workflow we can repeat and audit.

Rules we don’t bend

Provenance-first: every item has a source URL/stable ID, timestamps, and a hash.
Permission-aware: “publicly accessible” is not the same as “OK to store and reuse.”
Minimize harm: avoid sensitive personal data unless we have a strong reason and safeguards.
Traceability: every chunk should trace back to a source item and a rights label.
Right to removal: if a rights-holder asks, we can remove and re-generate derived data.

What we’ll ingest

Allowed sources are typically:

Public-domain materials.
Government/public-sector publications that are clearly public (or explicitly licensed).
Open-access academic materials with clear reuse terms.
Creative Commons licensed content (CC0 / CC‑BY / CC‑BY‑SA / CC‑BY‑NC(-SA)), respecting restrictions.
Content we have explicit written permission to ingest.

What we won’t ingest (or we restrict)

Do not ingest:

Pirated copies of books/articles.
Paywalled materials unless we have explicit rights to store and use them.
Materials containing private personal data (PII) that is not essential to the educational goal.
Content whose license prohibits redistribution or derivative works (unless explicit permission says otherwise).

Minimum metadata we record per source

For every source we ingest, we record:

title
source_type (e.g., paper, standard, policy report, video transcript)
origin (publisher/author/channel/organization)
url (or stable identifier)
license (as specific as possible, including version)
retrieved_at (UTC)
curation_notes (why included, intended learning value)

Privacy and sensitive data

Avoid ingesting content that includes personal phone numbers, email addresses, home addresses, or other identifiers.
If a source is valuable but contains incidental PII, prefer redaction at the transcript/chunk stage.
Avoid ingesting confidential student work unless written consent exists and access is restricted.

Representation and coverage

Aim for topic diversity (technical, philosophical, governance, sociotechnical, global perspectives).
Track and periodically review coverage gaps (region, language, discipline, stakeholder groups).
Maintain a quality tiering scheme where “quality” includes both credibility and pedagogical relevance.

Takedown / removal

Maintain an internal list of source URLs/IDs and who ingested them.
If a rights-holder requests removal, remove the item and its derived artifacts (chunks/embeddings) promptly.
Record the request and action taken (date, reason, scope of deletion).

How often we re-check

Re-audit newly added sources monthly during active ingestion.
Re-audit the full corpus at least once per term/semester.