Golden Eval Dataset: Complex Museum Questions
This document defines the golden dataset used for AI/LLM reliability checks on grounding and citation behavior.
Dataset file:
- `evals/golden-museum-questions.v1.json`
Scope
- Minimum prompt count: `100` (current: `120`)
- Focus: complex museum/research questions where grounding quality is high risk
- Required behavior: `cite-or-refuse`
Prompt contract
Each prompt row includes:
- stable `id` (`MM-EVAL-xxxx`)
- `domain` and `difficulty`
- full natural-language `prompt`
- `expectedBehavior` with:
- grounding requirements
- per-claim citation requirements
- refusal behavior when evidence is missing
- explicit failure behaviors (`mustNot`)
Citation expectations are strict:
- per-claim citations
- entity identifier references
- property-path scoped references
- source URL and retrieval timestamp
Reliability rubric
Dataset-level thresholds:
- `groundedPrecisionThreshold = 0.95`
- `citationCoverageThreshold = 0.95`
- `citationFreshnessThreshold = 0.95`
- refusal policy = `cite-or-refuse`
Conformance enforcement
Executable quality test:
- `tests/quality/ai-eval-golden-dataset.test.ts`
The test enforces:
- prompt minimum (`>=100`)
- prompt ID uniqueness
- domain diversity
- required grounding/citation/refusal fields for every prompt row
CI gate (AI-layer PRs)
Eval harness:
- `scripts/ai-eval-gate.ts`
- `src/services/ai-eval-harness.ts`
Commands:
- `pnpm ai:eval:report` (writes artifact without failing)
- `pnpm ai:eval:gate` (fails when thresholds/regressions are not met)
- `pnpm ai:eval:baseline:record` (records/updates baseline for the current dataset/model/prompt identity)
Workflow:
- `.github/workflows/ai-eval-gate.yml`
Gate metrics:
- faithfulness
- relevance
- citation accuracy
- citation freshness
Output artifact:
- `artifacts/evals/ai-eval-gate-latest.json`
- `artifacts/evals/runs/ai-eval-gate-<timestamp>.json`
- `artifacts/evals/trend-index.json`
- `artifacts/evals/summary.md`
- `artifacts/evals/summary.md` includes review priority when at least two retained runs exist.
- `/api/ai-evals/summary` exposes the same review priority and severity history as JSON for agents.
- `/api/openapi` references `AiEvalSummaryResponse` for `/api/ai-evals/summary`, and route tests parse payloads with `aiEvalSummaryResponseSchema`.
- `/api/ai-evals/summary` returns `schemaVersion: 1`; unsupported versions must be rejected by agent contract tests before consumption.
Schema migration notes:
- `schemaVersion: 1` is the current agent contract for `/api/ai-evals/summary`.
- `schemaVersion: 2` is reserved for future additive agent fields or review-priority metadata changes that cannot be represented safely in v1.
- Do not accept v2 payloads until a v2 parser, v2 fixture, OpenAPI schema update, and migration notes land in the same RSI cycle.
- Fixture compatibility tests keep `tests/fixtures/ai-eval-summary/schema-v1-ready.json` accepted and `schema-v2-planned.json` rejected until that migration exists.
CI summary behavior:
- `.github/workflows/ai-eval-gate.yml` appends `artifacts/evals/summary.md` to `$GITHUB_STEP_SUMMARY`.
- The same workflow uploads `artifacts/evals/` as `ai-eval-artifacts`.
- Summary badges include status, faithfulness, relevance, citation accuracy, `citationFreshness`, and pass rate.
- Freshness-aging alerts warn when `citationFreshness` stays flat/green while the oldest cited evidence approaches the max policy window.
- Review priority summarizes the latest-vs-previous severity (`regression`, `watch`, `stable`, or `improved`) directly in the CI summary.
- Malformed `diffSeverityPolicy` values fail validation instead of silently falling back.
- CI logs emit a GitHub warning annotation when review priority is `watch` or `regression`.
CI annotation examples:
::warning title=AI eval regression::Review priority is regression. current-run vs previous-run. Review severity%3A regression.
::warning title=AI eval watch::Review priority is watch. current-run vs previous-run. Review severity%3A watch.
Dashboard:
- `/ai-evals` is the read-only local artifact dashboard for humans and AI-assisted operators.
- It reads ignored local `artifacts/evals/*` files when present and shows an explicit non-failing empty state when absent.
- It surfaces latest metrics, run identity, `citationFreshness`, freshness-aging pressure, active warnings, trend history, and artifact timestamps.
- It includes a latest-vs-previous diff for retained runs with metric deltas, freshness-aging pressure movement, and “what changed” notes for fast review.
- It labels latest-vs-previous diffs as `regression`, `watch`, `stable`, or `improved` using explicit metric-drop and freshness-aging thresholds.
- It reads those thresholds from `config/ai-eval-regression-policy.json` and shows policy-window severity distribution counts.
- It shows a compact severity sparkline/history where `!` = regression, `~` = watch, `=` = stable, and `+` = improved.
Retention controls:
- `METAMUSEUM_EVAL_RETENTION_MAX_RUNS` (default `200`)
- optional CLI override: `--retention-max-runs=<n>`
- `severityHistoryPolicy.maxComparisons` in `config/ai-eval-regression-policy.json` controls how many latest-vs-previous severity comparisons are shown to humans and agents.
- Orphaned JSON files under `artifacts/evals/runs/` are pruned when they fall outside the retained trend index; non-JSON operator notes are preserved.
- Retention pruning reports `delete` or `dry-run` mode in `artifacts/evals/summary.md`, including retained, orphaned, deleted, and preserved-file counts.
- The latest retention pruning report is also exposed through `/api/ai-evals/summary` as `summary.retentionPruneReport` for agent consumers.
- Generated CI summary markdown is snapshot-locked by `tests/fixtures/ai-eval-summary/summary-snapshot.md`.
- `/api/openapi` includes the AI eval summary schema migration compatibility table so agents can discover current and planned summary versions.
Regression thresholds + fail-fast citation drift
Policy file:
- `config/ai-eval-regression-policy.json`
Policy includes:
- absolute floor thresholds (`faithfulness`, `relevance`, `citationAccuracy`, `citationFreshness`, `passRate`)
- drift thresholds (`max*Drop`) for regression comparisons
- diff severity thresholds (`diffSeverityPolicy`) for dashboard/CI review priority
- `diffSeverityPolicy` schema/order validation before severity classification
- severity-history retention window (`severityHistoryPolicy.maxComparisons`) with explicit malformed-policy failure tests
- `failFastOnCitationDrift` control
- `failFastOnCitationFreshnessDrift` control
- versioned baselines keyed by:
- dataset ID + dataset version
- model version
- prompt version
Version identity env vars:
- `METAMUSEUM_EVAL_MODEL_VERSION`
- `METAMUSEUM_EVAL_PROMPT_VERSION`
- `METAMUSEUM_EVAL_RETENTION_MAX_RUNS`
Fail-fast behavior:
- If citation accuracy drop exceeds `maxCitationAccuracyDrop`, gate exits immediately with failure.
- If citation freshness drop exceeds `maxCitationFreshnessDrop`, gate exits immediately with failure.
- If citation freshness is flat or improved while `oldestAgeRatio` crosses the aging-pressure threshold, the summary emits a `citation_freshness_aging_pressure` warning for proactive review.