Policy & curation process · Docs · AIethicsChat

A simple, defensible workflow: intake → automated checks → human review → publish → audit/removal.

This is the workflow we use to keep the corpus license-aware, reviewable, and easy to audit.

The policy (short version)

We’re building a retrieval corpus for AI ethics research.

We only store and retrieve full text when we have permission evidence (public domain, open license, or explicit permission) that supports full‑text use.
When rights are unclear, we keep metadata only (title, citation, link) and do not store full text.
We maintain provenance, allow removal, and audit all inclusion/exclusion decisions.

Ingest policies (what the pipeline is allowed to store)

ALLOW_FULLTEXT
- Allowed when: public-domain, open license, or explicit permission.
- Requires: rights evidence URL/note + human review status.
METADATA_ONLY
- Use when: relevance is high but rights/permission unclear or restrictive.
- Store: citation + link + minimal metadata.
BLOCKED
- Use when: rights are clearly restricted and cannot be used, or privacy/safety concerns dominate.
- Action: remove artifacts and derived chunks/embeddings.

Minimum metadata we require

We try to make these fields non-optional, because they’re what make audits possible:

Provenance: source_url, retrieved_at, pdf_sha256
Identity: title, kind, language (best-effort)
Rights: license_spdx (or license name) + rights_evidence_url + rights_evidence_note
Decision: ingest_policy + ethical_review_status + reviewer identity/time

The workflow

Intake (starts unreviewed)
- Record where the PDF came from and attach any license/permission link.
Automated checks (fast)
- Hash + dedupe
- Identify likely PII/sensitive content (flag, don’t block automatically)
- Validate that a rights evidence URL is present when proposing ALLOW_FULLTEXT
Human review (the actual decision)
- Approve full-text, approve metadata-only, needs legal review, or reject/remove.
Publish (enforcement)
- Once approved, proceed to chunk/embed for retrieval.

Review status taxonomy

This is what we store under a document’s metadata:

{
  "ethical_review": {
    "status": "approved_fulltext",
    "reviewed_by": "name-or-role",
    "reviewed_at": "2026-02-23T00:00:00Z",
    "notes": "Why this is OK; cite license evidence"
  },
  "rights": {
    "evidence_url": "https://...",
    "evidence_note": "License statement location / screenshot reference"
  }
}

Allowed status values:

unreviewed
approved_fulltext
approved_metadata_only
needs_legal_review
rejected_remove

Multiple curated sets (optional, but useful)

We keep separate collections (or at least tags/filters) so we can compare behavior and avoid mixing incompatible rights expectations:

Government/policy set (often open government licenses)
Standards set (often restrictive; may require metadata-only)
Open-access academic set (verify OA terms; don’t assume)
Course readings set (high risk; usually metadata-only unless permission exists)
Regional sets (EU/US/Canada/Global South) to monitor representation