Policy & curation process

A simple, defensible workflow: intake → automated checks → human review → publish → audit/removal.

This is the workflow we use to keep the corpus license-aware, reviewable, and easy to audit.

The policy (short version)

We’re building a retrieval corpus for AI ethics research.

  • We only store and retrieve full text when we have permission evidence (public domain, open license, or explicit permission) that supports full‑text use.
  • When rights are unclear, we keep metadata only (title, citation, link) and do not store full text.
  • We maintain provenance, allow removal, and audit all inclusion/exclusion decisions.

Ingest policies (what the pipeline is allowed to store)

  • ALLOW_FULLTEXT

    • Allowed when: public-domain, open license, or explicit permission.
    • Requires: rights evidence URL/note + human review status.
  • METADATA_ONLY

    • Use when: relevance is high but rights/permission unclear or restrictive.
    • Store: citation + link + minimal metadata.
  • BLOCKED

    • Use when: rights are clearly restricted and cannot be used, or privacy/safety concerns dominate.
    • Action: remove artifacts and derived chunks/embeddings.

Minimum metadata we require

We try to make these fields non-optional, because they’re what make audits possible:

  • Provenance: source_url, retrieved_at, pdf_sha256
  • Identity: title, kind, language (best-effort)
  • Rights: license_spdx (or license name) + rights_evidence_url + rights_evidence_note
  • Decision: ingest_policy + ethical_review_status + reviewer identity/time

The workflow

  1. Intake (starts unreviewed)

    • Record where the PDF came from and attach any license/permission link.
  2. Automated checks (fast)

    • Hash + dedupe
    • Identify likely PII/sensitive content (flag, don’t block automatically)
    • Validate that a rights evidence URL is present when proposing ALLOW_FULLTEXT
  3. Human review (the actual decision)

    • Approve full-text, approve metadata-only, needs legal review, or reject/remove.
  4. Publish (enforcement)

    • Once approved, proceed to chunk/embed for retrieval.

Review status taxonomy

This is what we store under a document’s metadata:

{
  "ethical_review": {
    "status": "approved_fulltext",
    "reviewed_by": "name-or-role",
    "reviewed_at": "2026-02-23T00:00:00Z",
    "notes": "Why this is OK; cite license evidence"
  },
  "rights": {
    "evidence_url": "https://...",
    "evidence_note": "License statement location / screenshot reference"
  }
}

Allowed status values:

  • unreviewed
  • approved_fulltext
  • approved_metadata_only
  • needs_legal_review
  • rejected_remove

Multiple curated sets (optional, but useful)

We keep separate collections (or at least tags/filters) so we can compare behavior and avoid mixing incompatible rights expectations:

  • Government/policy set (often open government licenses)
  • Standards set (often restrictive; may require metadata-only)
  • Open-access academic set (verify OA terms; don’t assume)
  • Course readings set (high risk; usually metadata-only unless permission exists)
  • Regional sets (EU/US/Canada/Global South) to monitor representation