PDF inventory dashboard

A searchable, auditable inventory of PDFs with provenance + rights evidence (no “not copyrighted” claims).

We built a searchable, auditable inventory of all PDFs in the corpus so a reviewer (or the class) can answer:

  • How did we collect this PDF? (pipeline + ingest policy)
  • Where did it come from? (publisher/source URL + storage URI/path)
  • What is it? (title, kind, language, hash, size)
  • What rights/ethics status do we have evidence for? (license, evidence link, review status)

This is designed for a RAG corpus builder. In our project, “training dataset” includes:

  • the retrieval corpus (PDFs + extracted text), and
  • optionally an evaluation dataset (queries + expected sources).

Key principle (important)

The system should not label items as “not copyrighted”.

Instead, it should display:

  • whether the item is public domain or open-licensed, based on evidence, and
  • whether the project has approved it for full-text storage (ALLOW_FULLTEXT) vs metadata-only (METADATA_ONLY).

Copyright-safe use is about permission/licensing evidence, not “copyright doesn’t exist”.

Inventory export format

The pipeline exports:

  • artifacts/manifests/pdf_inventory/pdf_inventory.csv
  • (optional) artifacts/manifests/pdf_inventory/pdf_inventory.jsonl

Both are generated by scripts/export_pdf_inventory.py.

Recommended dashboard fields

Minimum “search + filter” fields:

  • title, kind, source_url, external_id
  • license_spdx, license_is_open, rights_basis, rights_clear_for_fulltext
  • ingest_policy, ethical_review_status

Audit/support fields:

  • pdf_sha256, pdf_storage_uri, retrieved_at, updated_at
  • rights_evidence_url, rights_evidence_note
  • ethical_reviewed_by, ethical_reviewed_at, ethical_review_notes

Quality/progress fields:

  • chunks_count, embeddings_384_count, embeddings_384_complete

Status taxonomy (display labels)

Ingest policy (what the pipeline is allowed to store)

  • ALLOW_FULLTEXT: full text may be stored and chunked (requires rights basis)
  • METADATA_ONLY: keep bibliographic metadata + links only; avoid full text
  • BLOCKED: exclude from corpus; remove artifacts and derived chunks/embeddings

Rights basis (what permission claim we currently have)

  • public_domain_or_dedication: CC0 / public domain style basis
  • open_license: an open license recorded (e.g., CC-BY)
  • claimed_license_unverified: license string exists but no registry/evidence
  • metadata_only: explicitly restricted
  • blocked: explicitly blocked
  • unknown: no rights info recorded yet

Ethical review status (human decision)

  • unreviewed
  • approved_fulltext
  • approved_metadata_only
  • needs_legal_review
  • rejected_remove

What “affirmed” should mean

If the UI must show an “affirmed” badge, we treat it as:

Affirmed for full-text =

  • ingest_policy == ALLOW_FULLTEXT, AND
  • rights_clear_for_fulltext == true, AND
  • ethical_review_status == approved_fulltext, AND
  • rights_evidence_url is present (or equivalent evidence recorded)

Everything else should be shown as unverified / pending review.

How this fits into how we built the corpus

In our pipeline, PDFs come from a mix of sources (web downloads, official reports, and sometimes local course folders). The inventory lets us separate:

  • Provenance (we can prove what file we have and where it came from), from
  • Rights evidence (we can justify what we’re allowed to store and retrieve).

That’s why the default stance is conservative: if rights evidence is missing or unclear, we keep the item metadata-only until reviewed.