Policy & curation process
A simple, defensible workflow: intake → automated checks → human review → publish → audit/removal.
This is the workflow we use to keep the corpus license-aware, reviewable, and easy to audit.
The policy (short version)
We’re building a retrieval corpus for AI ethics research.
- We only store and retrieve full text when we have permission evidence (public domain, open license, or explicit permission) that supports full‑text use.
- When rights are unclear, we keep metadata only (title, citation, link) and do not store full text.
- We maintain provenance, allow removal, and audit all inclusion/exclusion decisions.
Ingest policies (what the pipeline is allowed to store)
ALLOW_FULLTEXT- Allowed when: public-domain, open license, or explicit permission.
- Requires: rights evidence URL/note + human review status.
METADATA_ONLY- Use when: relevance is high but rights/permission unclear or restrictive.
- Store: citation + link + minimal metadata.
BLOCKED- Use when: rights are clearly restricted and cannot be used, or privacy/safety concerns dominate.
- Action: remove artifacts and derived chunks/embeddings.
Minimum metadata we require
We try to make these fields non-optional, because they’re what make audits possible:
- Provenance:
source_url,retrieved_at,pdf_sha256 - Identity:
title,kind,language(best-effort) - Rights:
license_spdx(or license name) +rights_evidence_url+rights_evidence_note - Decision:
ingest_policy+ethical_review_status+ reviewer identity/time
The workflow
Intake (starts unreviewed)
- Record where the PDF came from and attach any license/permission link.
Automated checks (fast)
- Hash + dedupe
- Identify likely PII/sensitive content (flag, don’t block automatically)
- Validate that a rights evidence URL is present when proposing
ALLOW_FULLTEXT
Human review (the actual decision)
- Approve full-text, approve metadata-only, needs legal review, or reject/remove.
Publish (enforcement)
- Once approved, proceed to chunk/embed for retrieval.
Review status taxonomy
This is what we store under a document’s metadata:
{
"ethical_review": {
"status": "approved_fulltext",
"reviewed_by": "name-or-role",
"reviewed_at": "2026-02-23T00:00:00Z",
"notes": "Why this is OK; cite license evidence"
},
"rights": {
"evidence_url": "https://...",
"evidence_note": "License statement location / screenshot reference"
}
}
Allowed status values:
unreviewedapproved_fulltextapproved_metadata_onlyneeds_legal_reviewrejected_remove
Multiple curated sets (optional, but useful)
We keep separate collections (or at least tags/filters) so we can compare behavior and avoid mixing incompatible rights expectations:
- Government/policy set (often open government licenses)
- Standards set (often restrictive; may require metadata-only)
- Open-access academic set (verify OA terms; don’t assume)
- Course readings set (high risk; usually metadata-only unless permission exists)
- Regional sets (EU/US/Canada/Global South) to monitor representation