Classroom evaluation set template

A small CSV format we use to build repeatable retrieval tests with expected sources.

This is the template we use when we’re building a classroom evaluation set for retrieval.

The idea is simple: students propose questions, and we record what sources should be retrieved for those questions. That gives us repeatable tests for:

  • retrieval quality (Recall@k / nDCG@k)
  • coverage gaps (by domain, jurisdiction, language)
  • drift over time (does retrieval change after ingestion updates?)

Fields (what each column means)

  • query: the question students will ask.
  • expected_sources: one or more source URLs/IDs that must appear in the top‑k retrieval (separate multiple sources with ;).
  • domain: policy / standards / academic / course / other.
  • jurisdiction: the relevant region (e.g., NG, EU, US, Global).
  • language: best-effort, e.g. en, fr.
  • difficulty: easy / medium / hard.
  • notes: why the query matters, edge cases, grading guidance.

Download the CSV header

If you don’t want to download a file, here’s the same header row:

query,expected_sources,domain,jurisdiction,language,difficulty,notes

How we recommend using it

  1. Start with 50–200 queries.
  2. Require each query to include at least 1 expected source.
  3. Tag each query (domain/jurisdiction/language) so you can slice results and see what the corpus is missing.
  4. Keep the file under version control so the class can see what changed and why.