Policies

Policies that govern data sourcing, privacy, and takedown/removal requests.

Data sources policy

Scope

This policy governs how we collect, store, and use source materials that feed AIethicsChat’s retrieval corpus.

It applies to:

  • the retrieval corpus (documents, chunks, transcripts) stored in Postgres
  • metadata stored about sources (title, URL, license, notes)
  • derived artifacts we generate (transcripts, chunks, embeddings)

This does not mean we are training a foundation model. The intended use is retrieval + citation-backed answers.

Goals

  • Make provenance and permissions explicit.
  • Minimize copyright, privacy, and security risks.
  • Support classroom/academic use with strong citation norms.
  • Maintain a defensible, repeatable curation workflow.

Core principles

  1. Provenance-first: every item must have a source URL or stable identifier.
  2. Permission-aware: only include content we have the right to store and serve as excerpts.
  3. Minimize harm: avoid collecting sensitive personal data; treat vulnerable populations with care.
  4. Traceability: every chunk should trace back to a source item and license label.
  5. Right to removal: provide a clear takedown process.

Allowed sources

Prefer sources that are:

  • public domain
  • government/public-sector publications that are clearly public (or explicitly licensed)
  • open-access academic materials with clear reuse terms
  • Creative Commons licensed content (respecting restrictions)
  • content with explicit written permission to ingest

Disallowed or restricted sources

Do not ingest:

  • pirated copies of books/articles
  • paywalled materials unless we have explicit rights to store and use them
  • materials containing private personal data (PII) that isn’t essential to the educational goal
  • content whose license prohibits redistribution or derivative works (unless explicit permission covers this)

Minimum required metadata per source

For each source item we ingest, record at least:

  • title
  • source type (paper, standard, policy report, video transcript, etc.)
  • origin (publisher/author/channel/organization)
  • URL (or stable identifier)
  • license (as specific as possible, including version)
  • retrieved_at (UTC)
  • curation notes (why included, intended learning value)

Privacy & sensitive data

  • Avoid ingesting phone numbers, email addresses, home addresses, or similar identifiers.
  • If a source is valuable but contains incidental PII, prefer redaction at the transcript/chunk stage.
  • Avoid ingesting confidential student work unless there is written consent and access is restricted.

Takedown / removal

If a rights-holder requests removal:

  • remove the item and its derived artifacts (chunks/embeddings) promptly
  • record the request and action taken (date, reason, scope of deletion)

Review cadence

  • Re-audit newly added sources monthly during active ingestion.
  • Re-audit the full corpus at least once per term/semester.