PDF inventory dashboard
A searchable, auditable inventory of PDFs with provenance + rights evidence (no “not copyrighted” claims).
We built a searchable, auditable inventory of all PDFs in the corpus so a reviewer (or the class) can answer:
- How did we collect this PDF? (pipeline + ingest policy)
- Where did it come from? (publisher/source URL + storage URI/path)
- What is it? (title, kind, language, hash, size)
- What rights/ethics status do we have evidence for? (license, evidence link, review status)
This is designed for a RAG corpus builder. In our project, “training dataset” includes:
- the retrieval corpus (PDFs + extracted text), and
- optionally an evaluation dataset (queries + expected sources).
Key principle (important)
The system should not label items as “not copyrighted”.
Instead, it should display:
- whether the item is public domain or open-licensed, based on evidence, and
- whether the project has approved it for full-text storage (
ALLOW_FULLTEXT) vs metadata-only (METADATA_ONLY).
Copyright-safe use is about permission/licensing evidence, not “copyright doesn’t exist”.
Inventory export format
The pipeline exports:
artifacts/manifests/pdf_inventory/pdf_inventory.csv- (optional)
artifacts/manifests/pdf_inventory/pdf_inventory.jsonl
Both are generated by scripts/export_pdf_inventory.py.
Recommended dashboard fields
Minimum “search + filter” fields:
title,kind,source_url,external_idlicense_spdx,license_is_open,rights_basis,rights_clear_for_fulltextingest_policy,ethical_review_status
Audit/support fields:
pdf_sha256,pdf_storage_uri,retrieved_at,updated_atrights_evidence_url,rights_evidence_noteethical_reviewed_by,ethical_reviewed_at,ethical_review_notes
Quality/progress fields:
chunks_count,embeddings_384_count,embeddings_384_complete
Status taxonomy (display labels)
Ingest policy (what the pipeline is allowed to store)
ALLOW_FULLTEXT: full text may be stored and chunked (requires rights basis)METADATA_ONLY: keep bibliographic metadata + links only; avoid full textBLOCKED: exclude from corpus; remove artifacts and derived chunks/embeddings
Rights basis (what permission claim we currently have)
public_domain_or_dedication: CC0 / public domain style basisopen_license: an open license recorded (e.g., CC-BY)claimed_license_unverified: license string exists but no registry/evidencemetadata_only: explicitly restrictedblocked: explicitly blockedunknown: no rights info recorded yet
Ethical review status (human decision)
unreviewedapproved_fulltextapproved_metadata_onlyneeds_legal_reviewrejected_remove
What “affirmed” should mean
If the UI must show an “affirmed” badge, we treat it as:
Affirmed for full-text =
ingest_policy == ALLOW_FULLTEXT, ANDrights_clear_for_fulltext == true, ANDethical_review_status == approved_fulltext, ANDrights_evidence_urlis present (or equivalent evidence recorded)
Everything else should be shown as unverified / pending review.
How this fits into how we built the corpus
In our pipeline, PDFs come from a mix of sources (web downloads, official reports, and sometimes local course folders). The inventory lets us separate:
- Provenance (we can prove what file we have and where it came from), from
- Rights evidence (we can justify what we’re allowed to store and retrieve).
That’s why the default stance is conservative: if rights evidence is missing or unclear, we keep the item metadata-only until reviewed.