Policies
Policies that govern data sourcing, privacy, and takedown/removal requests.
Data sources policy
Scope
This policy governs how we collect, store, and use source materials that feed AIethicsChat’s retrieval corpus.
It applies to:
- the retrieval corpus (documents, chunks, transcripts) stored in Postgres
- metadata stored about sources (title, URL, license, notes)
- derived artifacts we generate (transcripts, chunks, embeddings)
This does not mean we are training a foundation model. The intended use is retrieval + citation-backed answers.
Goals
- Make provenance and permissions explicit.
- Minimize copyright, privacy, and security risks.
- Support classroom/academic use with strong citation norms.
- Maintain a defensible, repeatable curation workflow.
Core principles
- Provenance-first: every item must have a source URL or stable identifier.
- Permission-aware: only include content we have the right to store and serve as excerpts.
- Minimize harm: avoid collecting sensitive personal data; treat vulnerable populations with care.
- Traceability: every chunk should trace back to a source item and license label.
- Right to removal: provide a clear takedown process.
Allowed sources
Prefer sources that are:
- public domain
- government/public-sector publications that are clearly public (or explicitly licensed)
- open-access academic materials with clear reuse terms
- Creative Commons licensed content (respecting restrictions)
- content with explicit written permission to ingest
Disallowed or restricted sources
Do not ingest:
- pirated copies of books/articles
- paywalled materials unless we have explicit rights to store and use them
- materials containing private personal data (PII) that isn’t essential to the educational goal
- content whose license prohibits redistribution or derivative works (unless explicit permission covers this)
Minimum required metadata per source
For each source item we ingest, record at least:
- title
- source type (paper, standard, policy report, video transcript, etc.)
- origin (publisher/author/channel/organization)
- URL (or stable identifier)
- license (as specific as possible, including version)
- retrieved_at (UTC)
- curation notes (why included, intended learning value)
Privacy & sensitive data
- Avoid ingesting phone numbers, email addresses, home addresses, or similar identifiers.
- If a source is valuable but contains incidental PII, prefer redaction at the transcript/chunk stage.
- Avoid ingesting confidential student work unless there is written consent and access is restricted.
Takedown / removal
If a rights-holder requests removal:
- remove the item and its derived artifacts (chunks/embeddings) promptly
- record the request and action taken (date, reason, scope of deletion)
Review cadence
- Re-audit newly added sources monthly during active ingestion.
- Re-audit the full corpus at least once per term/semester.