PDF inventory dashboard · Docs · AIethicsChat

A searchable, auditable inventory of PDFs with provenance + rights evidence (no “not copyrighted” claims).

We built a searchable, auditable inventory of all PDFs in the corpus so a reviewer (or the class) can answer:

How did we collect this PDF? (pipeline + ingest policy)
Where did it come from? (publisher/source URL + storage URI/path)
What is it? (title, kind, language, hash, size)
What rights/ethics status do we have evidence for? (license, evidence link, review status)

This is designed for a RAG corpus builder. In our project, “training dataset” includes:

the retrieval corpus (PDFs + extracted text), and
optionally an evaluation dataset (queries + expected sources).

Key principle (important)

The system should not label items as “not copyrighted”.

Instead, it should display:

whether the item is public domain or open-licensed, based on evidence, and
whether the project has approved it for full-text storage (ALLOW_FULLTEXT) vs metadata-only (METADATA_ONLY).

Inventory export format

The pipeline exports:

artifacts/manifests/pdf_inventory/pdf_inventory.csv
(optional) artifacts/manifests/pdf_inventory/pdf_inventory.jsonl

Both are generated by scripts/export_pdf_inventory.py.

Recommended dashboard fields

Minimum “search + filter” fields:

title, kind, source_url, external_id
license_spdx, license_is_open, rights_basis, rights_clear_for_fulltext
ingest_policy, ethical_review_status

Audit/support fields:

pdf_sha256, pdf_storage_uri, retrieved_at, updated_at
rights_evidence_url, rights_evidence_note
ethical_reviewed_by, ethical_reviewed_at, ethical_review_notes

Quality/progress fields:

chunks_count, embeddings_384_count, embeddings_384_complete

Status taxonomy (display labels)

Ingest policy (what the pipeline is allowed to store)

ALLOW_FULLTEXT: full text may be stored and chunked (requires rights basis)
METADATA_ONLY: keep bibliographic metadata + links only; avoid full text
BLOCKED: exclude from corpus; remove artifacts and derived chunks/embeddings

Rights basis (what permission claim we currently have)

public_domain_or_dedication: CC0 / public domain style basis
open_license: an open license recorded (e.g., CC-BY)
claimed_license_unverified: license string exists but no registry/evidence
metadata_only: explicitly restricted
blocked: explicitly blocked
unknown: no rights info recorded yet

Ethical review status (human decision)

unreviewed
approved_fulltext
approved_metadata_only
needs_legal_review
rejected_remove

What “affirmed” should mean

If the UI must show an “affirmed” badge, we treat it as:

Affirmed for full-text =

ingest_policy == ALLOW_FULLTEXT, AND
rights_clear_for_fulltext == true, AND
ethical_review_status == approved_fulltext, AND
rights_evidence_url is present (or equivalent evidence recorded)

Everything else should be shown as unverified / pending review.

How this fits into how we built the corpus

In our pipeline, PDFs come from a mix of sources (web downloads, official reports, and sometimes local course folders). The inventory lets us separate:

Provenance (we can prove what file we have and where it came from), from
Rights evidence (we can justify what we’re allowed to store and retrieve).

That’s why the default stance is conservative: if rights evidence is missing or unclear, we keep the item metadata-only until reviewed.