{"id":"evals/golden-museum-questions","relativePath":"evals/golden-museum-questions.md","title":"Golden Eval Dataset: Complex Museum Questions","markdown":"# Golden Eval Dataset: Complex Museum Questions\n\nThis document defines the golden dataset used for AI/LLM reliability checks on grounding and citation behavior.\n\nDataset file:\n\n- `evals/golden-museum-questions.v1.json`\n\n## Scope\n\n- Minimum prompt count: `100` (current: `120`)\n- Focus: complex museum/research questions where grounding quality is high risk\n- Required behavior: `cite-or-refuse`\n\n## Prompt contract\n\nEach prompt row includes:\n\n- stable `id` (`MM-EVAL-xxxx`)\n- `domain` and `difficulty`\n- full natural-language `prompt`\n- `expectedBehavior` with:\n  - grounding requirements\n  - per-claim citation requirements\n  - refusal behavior when evidence is missing\n  - explicit failure behaviors (`mustNot`)\n\nCitation expectations are strict:\n\n- per-claim citations\n- entity identifier references\n- property-path scoped references\n- source URL and retrieval timestamp\n\n## Reliability rubric\n\nDataset-level thresholds:\n\n- `groundedPrecisionThreshold = 0.95`\n- `citationCoverageThreshold = 0.95`\n- `citationFreshnessThreshold = 0.95`\n- refusal policy = `cite-or-refuse`\n\n## Conformance enforcement\n\nExecutable quality test:\n\n- `tests/quality/ai-eval-golden-dataset.test.ts`\n\nThe test enforces:\n\n- prompt minimum (`>=100`)\n- prompt ID uniqueness\n- domain diversity\n- required grounding/citation/refusal fields for every prompt row\n\n## CI gate (AI-layer PRs)\n\nEval harness:\n\n- `scripts/ai-eval-gate.ts`\n- `src/services/ai-eval-harness.ts`\n\nCommands:\n\n- `pnpm ai:eval:report` (writes artifact without failing)\n- `pnpm ai:eval:gate` (fails when thresholds/regressions are not met)\n- `pnpm ai:eval:baseline:record` (records/updates baseline for the current dataset/model/prompt identity)\n\nWorkflow:\n\n- `.github/workflows/ai-eval-gate.yml`\n\nGate metrics:\n\n- faithfulness\n- relevance\n- citation accuracy\n- citation freshness\n\nOutput artifact:\n\n- `artifacts/evals/ai-eval-gate-latest.json`\n- `artifacts/evals/runs/ai-eval-gate-<timestamp>.json`\n- `artifacts/evals/trend-index.json`\n- `artifacts/evals/summary.md`\n- `artifacts/evals/summary.md` includes review priority when at least two retained runs exist.\n- `/api/ai-evals/summary` exposes the same review priority and severity history as JSON for agents.\n- `/api/openapi` references `AiEvalSummaryResponse` for `/api/ai-evals/summary`, and route tests parse payloads with `aiEvalSummaryResponseSchema`.\n- `/api/ai-evals/summary` returns `schemaVersion: 1`; unsupported versions must be rejected by agent contract tests before consumption.\n\nSchema migration notes:\n\n- `schemaVersion: 1` is the current agent contract for `/api/ai-evals/summary`.\n- `schemaVersion: 2` is reserved for future additive agent fields or review-priority metadata changes that cannot be represented safely in v1.\n- Do not accept v2 payloads until a v2 parser, v2 fixture, OpenAPI schema update, and migration notes land in the same RSI cycle.\n- Fixture compatibility tests keep `tests/fixtures/ai-eval-summary/schema-v1-ready.json` accepted and `schema-v2-planned.json` rejected until that migration exists.\n\nCI summary behavior:\n\n- `.github/workflows/ai-eval-gate.yml` appends `artifacts/evals/summary.md` to `$GITHUB_STEP_SUMMARY`.\n- The same workflow uploads `artifacts/evals/` as `ai-eval-artifacts`.\n- Summary badges include status, faithfulness, relevance, citation accuracy, `citationFreshness`, and pass rate.\n- Freshness-aging alerts warn when `citationFreshness` stays flat/green while the oldest cited evidence approaches the max policy window.\n- Review priority summarizes the latest-vs-previous severity (`regression`, `watch`, `stable`, or `improved`) directly in the CI summary.\n- Malformed `diffSeverityPolicy` values fail validation instead of silently falling back.\n- CI logs emit a GitHub warning annotation when review priority is `watch` or `regression`.\n\nCI annotation examples:\n\n```text\n::warning title=AI eval regression::Review priority is regression. current-run vs previous-run. Review severity%3A regression.\n::warning title=AI eval watch::Review priority is watch. current-run vs previous-run. Review severity%3A watch.\n```\n\nDashboard:\n\n- `/ai-evals` is the read-only local artifact dashboard for humans and AI-assisted operators.\n- It reads ignored local `artifacts/evals/*` files when present and shows an explicit non-failing empty state when absent.\n- It surfaces latest metrics, run identity, `citationFreshness`, freshness-aging pressure, active warnings, trend history, and artifact timestamps.\n- It includes a latest-vs-previous diff for retained runs with metric deltas, freshness-aging pressure movement, and “what changed” notes for fast review.\n- It labels latest-vs-previous diffs as `regression`, `watch`, `stable`, or `improved` using explicit metric-drop and freshness-aging thresholds.\n- It reads those thresholds from `config/ai-eval-regression-policy.json` and shows policy-window severity distribution counts.\n- It shows a compact severity sparkline/history where `!` = regression, `~` = watch, `=` = stable, and `+` = improved.\n\nRetention controls:\n\n- `METAMUSEUM_EVAL_RETENTION_MAX_RUNS` (default `200`)\n- optional CLI override: `--retention-max-runs=<n>`\n- `severityHistoryPolicy.maxComparisons` in `config/ai-eval-regression-policy.json` controls how many latest-vs-previous severity comparisons are shown to humans and agents.\n- Orphaned JSON files under `artifacts/evals/runs/` are pruned when they fall outside the retained trend index; non-JSON operator notes are preserved.\n- Retention pruning reports `delete` or `dry-run` mode in `artifacts/evals/summary.md`, including retained, orphaned, deleted, and preserved-file counts.\n- The latest retention pruning report is also exposed through `/api/ai-evals/summary` as `summary.retentionPruneReport` for agent consumers.\n- Generated CI summary markdown is snapshot-locked by `tests/fixtures/ai-eval-summary/summary-snapshot.md`.\n- `/api/openapi` includes the AI eval summary schema migration compatibility table so agents can discover current and planned summary versions.\n\n## Regression thresholds + fail-fast citation drift\n\nPolicy file:\n\n- `config/ai-eval-regression-policy.json`\n\nPolicy includes:\n\n- absolute floor thresholds (`faithfulness`, `relevance`, `citationAccuracy`, `citationFreshness`, `passRate`)\n- drift thresholds (`max*Drop`) for regression comparisons\n- diff severity thresholds (`diffSeverityPolicy`) for dashboard/CI review priority\n- `diffSeverityPolicy` schema/order validation before severity classification\n- severity-history retention window (`severityHistoryPolicy.maxComparisons`) with explicit malformed-policy failure tests\n- `failFastOnCitationDrift` control\n- `failFastOnCitationFreshnessDrift` control\n- versioned baselines keyed by:\n  - dataset ID + dataset version\n  - model version\n  - prompt version\n\nVersion identity env vars:\n\n- `METAMUSEUM_EVAL_MODEL_VERSION`\n- `METAMUSEUM_EVAL_PROMPT_VERSION`\n- `METAMUSEUM_EVAL_RETENTION_MAX_RUNS`\n\nFail-fast behavior:\n\n- If citation accuracy drop exceeds `maxCitationAccuracyDrop`, gate exits immediately with failure.\n- If citation freshness drop exceeds `maxCitationFreshnessDrop`, gate exits immediately with failure.\n- If citation freshness is flat or improved while `oldestAgeRatio` crosses the aging-pressure threshold, the summary emits a `citation_freshness_aging_pressure` warning for proactive review.\n","sections":[{"level":2,"heading":"Scope","anchor":"scope"},{"level":2,"heading":"Prompt contract","anchor":"prompt-contract"},{"level":2,"heading":"Reliability rubric","anchor":"reliability-rubric"},{"level":2,"heading":"Conformance enforcement","anchor":"conformance-enforcement"},{"level":2,"heading":"CI gate (AI-layer PRs)","anchor":"ci-gate-ai-layer-prs"},{"level":2,"heading":"Regression thresholds + fail-fast citation drift","anchor":"regression-thresholds-fail-fast-citation-drift"}],"html":"<h1 id=\"golden-eval-dataset-complex-museum-questions\">Golden Eval Dataset: Complex Museum Questions</h1>\n<p>This document defines the golden dataset used for AI/LLM reliability checks on grounding and citation behavior.</p>\n<p>Dataset file:</p>\n<ul><li>`evals/golden-museum-questions.v1.json`</li></ul>\n<h2 id=\"scope\">Scope</h2>\n<ul><li>Minimum prompt count: `100` (current: `120`)</li><li>Focus: complex museum/research questions where grounding quality is high risk</li><li>Required behavior: `cite-or-refuse`</li></ul>\n<h2 id=\"prompt-contract\">Prompt contract</h2>\n<p>Each prompt row includes:</p>\n<ul><li>stable `id` (`MM-EVAL-xxxx`)</li><li>`domain` and `difficulty`</li><li>full natural-language `prompt`</li><li>`expectedBehavior` with:</li><li>grounding requirements</li><li>per-claim citation requirements</li><li>refusal behavior when evidence is missing</li><li>explicit failure behaviors (`mustNot`)</li></ul>\n<p>Citation expectations are strict:</p>\n<ul><li>per-claim citations</li><li>entity identifier references</li><li>property-path scoped references</li><li>source URL and retrieval timestamp</li></ul>\n<h2 id=\"reliability-rubric\">Reliability rubric</h2>\n<p>Dataset-level thresholds:</p>\n<ul><li>`groundedPrecisionThreshold = 0.95`</li><li>`citationCoverageThreshold = 0.95`</li><li>`citationFreshnessThreshold = 0.95`</li><li>refusal policy = `cite-or-refuse`</li></ul>\n<h2 id=\"conformance-enforcement\">Conformance enforcement</h2>\n<p>Executable quality test:</p>\n<ul><li>`tests/quality/ai-eval-golden-dataset.test.ts`</li></ul>\n<p>The test enforces:</p>\n<ul><li>prompt minimum (`&gt;=100`)</li><li>prompt ID uniqueness</li><li>domain diversity</li><li>required grounding/citation/refusal fields for every prompt row</li></ul>\n<h2 id=\"ci-gate-ai-layer-prs\">CI gate (AI-layer PRs)</h2>\n<p>Eval harness:</p>\n<ul><li>`scripts/ai-eval-gate.ts`</li><li>`src/services/ai-eval-harness.ts`</li></ul>\n<p>Commands:</p>\n<ul><li>`pnpm ai:eval:report` (writes artifact without failing)</li><li>`pnpm ai:eval:gate` (fails when thresholds/regressions are not met)</li><li>`pnpm ai:eval:baseline:record` (records/updates baseline for the current dataset/model/prompt identity)</li></ul>\n<p>Workflow:</p>\n<ul><li>`.github/workflows/ai-eval-gate.yml`</li></ul>\n<p>Gate metrics:</p>\n<ul><li>faithfulness</li><li>relevance</li><li>citation accuracy</li><li>citation freshness</li></ul>\n<p>Output artifact:</p>\n<ul><li>`artifacts/evals/ai-eval-gate-latest.json`</li><li>`artifacts/evals/runs/ai-eval-gate-&lt;timestamp&gt;.json`</li><li>`artifacts/evals/trend-index.json`</li><li>`artifacts/evals/summary.md`</li><li>`artifacts/evals/summary.md` includes review priority when at least two retained runs exist.</li><li>`/api/ai-evals/summary` exposes the same review priority and severity history as JSON for agents.</li><li>`/api/openapi` references `AiEvalSummaryResponse` for `/api/ai-evals/summary`, and route tests parse payloads with `aiEvalSummaryResponseSchema`.</li><li>`/api/ai-evals/summary` returns `schemaVersion: 1`; unsupported versions must be rejected by agent contract tests before consumption.</li></ul>\n<p>Schema migration notes:</p>\n<ul><li>`schemaVersion: 1` is the current agent contract for `/api/ai-evals/summary`.</li><li>`schemaVersion: 2` is reserved for future additive agent fields or review-priority metadata changes that cannot be represented safely in v1.</li><li>Do not accept v2 payloads until a v2 parser, v2 fixture, OpenAPI schema update, and migration notes land in the same RSI cycle.</li><li>Fixture compatibility tests keep `tests/fixtures/ai-eval-summary/schema-v1-ready.json` accepted and `schema-v2-planned.json` rejected until that migration exists.</li></ul>\n<p>CI summary behavior:</p>\n<ul><li>`.github/workflows/ai-eval-gate.yml` appends `artifacts/evals/summary.md` to `$GITHUB_STEP_SUMMARY`.</li><li>The same workflow uploads `artifacts/evals/` as `ai-eval-artifacts`.</li><li>Summary badges include status, faithfulness, relevance, citation accuracy, `citationFreshness`, and pass rate.</li><li>Freshness-aging alerts warn when `citationFreshness` stays flat/green while the oldest cited evidence approaches the max policy window.</li><li>Review priority summarizes the latest-vs-previous severity (`regression`, `watch`, `stable`, or `improved`) directly in the CI summary.</li><li>Malformed `diffSeverityPolicy` values fail validation instead of silently falling back.</li><li>CI logs emit a GitHub warning annotation when review priority is `watch` or `regression`.</li></ul>\n<p>CI annotation examples:</p>\n<pre><code>\n::warning title=AI eval regression::Review priority is regression. current-run vs previous-run. Review severity%3A regression.\n::warning title=AI eval watch::Review priority is watch. current-run vs previous-run. Review severity%3A watch.\n</code></pre>\n<p>Dashboard:</p>\n<ul><li>`/ai-evals` is the read-only local artifact dashboard for humans and AI-assisted operators.</li><li>It reads ignored local `artifacts/evals/*` files when present and shows an explicit non-failing empty state when absent.</li><li>It surfaces latest metrics, run identity, `citationFreshness`, freshness-aging pressure, active warnings, trend history, and artifact timestamps.</li><li>It includes a latest-vs-previous diff for retained runs with metric deltas, freshness-aging pressure movement, and “what changed” notes for fast review.</li><li>It labels latest-vs-previous diffs as `regression`, `watch`, `stable`, or `improved` using explicit metric-drop and freshness-aging thresholds.</li><li>It reads those thresholds from `config/ai-eval-regression-policy.json` and shows policy-window severity distribution counts.</li><li>It shows a compact severity sparkline/history where `!` = regression, `~` = watch, `=` = stable, and `+` = improved.</li></ul>\n<p>Retention controls:</p>\n<ul><li>`METAMUSEUM_EVAL_RETENTION_MAX_RUNS` (default `200`)</li><li>optional CLI override: `--retention-max-runs=&lt;n&gt;`</li><li>`severityHistoryPolicy.maxComparisons` in `config/ai-eval-regression-policy.json` controls how many latest-vs-previous severity comparisons are shown to humans and agents.</li><li>Orphaned JSON files under `artifacts/evals/runs/` are pruned when they fall outside the retained trend index; non-JSON operator notes are preserved.</li><li>Retention pruning reports `delete` or `dry-run` mode in `artifacts/evals/summary.md`, including retained, orphaned, deleted, and preserved-file counts.</li><li>The latest retention pruning report is also exposed through `/api/ai-evals/summary` as `summary.retentionPruneReport` for agent consumers.</li><li>Generated CI summary markdown is snapshot-locked by `tests/fixtures/ai-eval-summary/summary-snapshot.md`.</li><li>`/api/openapi` includes the AI eval summary schema migration compatibility table so agents can discover current and planned summary versions.</li></ul>\n<h2 id=\"regression-thresholds-fail-fast-citation-drift\">Regression thresholds + fail-fast citation drift</h2>\n<p>Policy file:</p>\n<ul><li>`config/ai-eval-regression-policy.json`</li></ul>\n<p>Policy includes:</p>\n<ul><li>absolute floor thresholds (`faithfulness`, `relevance`, `citationAccuracy`, `citationFreshness`, `passRate`)</li><li>drift thresholds (`max*Drop`) for regression comparisons</li><li>diff severity thresholds (`diffSeverityPolicy`) for dashboard/CI review priority</li><li>`diffSeverityPolicy` schema/order validation before severity classification</li><li>severity-history retention window (`severityHistoryPolicy.maxComparisons`) with explicit malformed-policy failure tests</li><li>`failFastOnCitationDrift` control</li><li>`failFastOnCitationFreshnessDrift` control</li><li>versioned baselines keyed by:</li><li>dataset ID + dataset version</li><li>model version</li><li>prompt version</li></ul>\n<p>Version identity env vars:</p>\n<ul><li>`METAMUSEUM_EVAL_MODEL_VERSION`</li><li>`METAMUSEUM_EVAL_PROMPT_VERSION`</li><li>`METAMUSEUM_EVAL_RETENTION_MAX_RUNS`</li></ul>\n<p>Fail-fast behavior:</p>\n<ul><li>If citation accuracy drop exceeds `maxCitationAccuracyDrop`, gate exits immediately with failure.</li><li>If citation freshness drop exceeds `maxCitationFreshnessDrop`, gate exits immediately with failure.</li><li>If citation freshness is flat or improved while `oldestAgeRatio` crosses the aging-pressure threshold, the summary emits a `citation_freshness_aging_pressure` warning for proactive review.</li></ul>","updatedAt":"2018-10-20T01:46:40.000Z","checksum":"2876a2b5e78dbfb093fb404058766bc1f6b8e2a8d841a42c30da8203ac546b23","checksumPrefix":"2876a2b5e78d","anchorCount":6,"lineCount":168,"rawUrl":"/api/docs/content?path=evals%2Fgolden-museum-questions.md","htmlUrl":"/docs?doc=evals%2Fgolden-museum-questions.md","apiUrl":"/api/docs/content?path=evals%2Fgolden-museum-questions.md"}