Catalog (70)

IDDocumentUpdatedAnchorsSHA
agents/ag2-extraction-notesAG2 Extraction Notes
agents/ag2-extraction-notes.md
10/20/2018, 1:46:40 AM11e8d0072ebec1
asset-provenanceAsset Provenance
asset-provenance.md
10/20/2018, 1:46:40 AM41025c0acc117
closeout-notesAI-RSI one-click closeout notes
closeout-notes.md
10/20/2018, 1:46:40 AM21f560f6a8535
content-credibility-engineContent Credibility Engine
content-credibility-engine.md
10/20/2018, 1:46:40 AM8d9aa32358670
demo-scriptDemo Video — Shot List & Script (60–90s)
demo-script.md
10/20/2018, 1:46:40 AM2131ddae42e6e
deploymentDeployment — Vercel + Render
deployment.md
10/20/2018, 1:46:40 AM84911b1f459b5
development-roadmapMeta Museum Development Roadmap
development-roadmap.md
10/20/2018, 1:46:40 AM23624a8a089d72
development/aidd-tddAIDD + TDD Discipline
development/aidd-tdd.md
10/20/2018, 1:46:40 AM5cd0a0524525a
envEnvironment Variables
env.md
10/20/2018, 1:46:40 AM109c18634cab1a
evals/golden-museum-questionsGolden Eval Dataset: Complex Museum Questions
evals/golden-museum-questions.md
10/20/2018, 1:46:40 AM62876a2b5e78d
linked-art/conformance-matrixLinked Art 1.0 — Conformance Matrix
linked-art/conformance-matrix.md
10/20/2018, 1:46:40 AM553ff87000bf4
linked-art/Linked%20Art%20NotesLinked Art Notes.md
linked-art/Linked Art Notes.md
10/20/2018, 1:46:40 AM0aca66d51107b
linked-art/Linked%20Open%20Art%20Data%20Web%20App%20-%20Must-have%20Data%20SourcesLinked Open Art Data Web App (AI) — Must-have Data Sources
linked-art/Linked Open Art Data Web App - Must-have Data Sources.md
10/20/2018, 1:46:40 AM77b7d350fe8a0
linked-art/LinkedArtAppFeatures🏛️ Art Explorer: Linked Art Application & Ecosystem
linked-art/LinkedArtAppFeatures.md
10/20/2018, 1:46:40 AM14e23b890ecd2a
linked-art/LinkedArtChallengesLinkedArtChallenges.md
linked-art/LinkedArtChallenges.md
10/20/2018, 1:46:40 AM0d8c987070277
linked-art/LinkedArtCollaborationLinkedArtCollaboration.md
linked-art/LinkedArtCollaboration.md
10/20/2018, 1:46:40 AM114ccf63edef3
linked-art/LinkedArtDashboardLinkedArtDashboard.md
linked-art/LinkedArtDashboard.md
10/20/2018, 1:46:40 AM06d04d4b2bf79
linked-art/LinkedArtFeatureRoadmapFeature Roadmap for Linked Open Art Data Apps
linked-art/LinkedArtFeatureRoadmap.md
10/20/2018, 1:46:40 AM8ac10d8e79c20
linked-art/LinkedArtJobReadyLinkedArtJobReady.md
linked-art/LinkedArtJobReady.md
10/20/2018, 1:46:40 AM0c60b357bcb87
linked-art/LinkedArtModel1.0-ReferenceLinked Art Model 1.0 Reference (Round 1)
linked-art/LinkedArtModel1.0-Reference.md
10/20/2018, 1:46:40 AM344e6d48d474b3e
linked-art/LinkedArtPatternsLinkedArtPatterns.md
linked-art/LinkedArtPatterns.md
10/20/2018, 1:46:40 AM0d45bbbb02d70
linked-art/LinkedArtPRD🖼️ Product Requirements Document
linked-art/LinkedArtPRD.md
10/20/2018, 1:46:40 AM2091bc1f37307c
linked-art/LinkedArtRoadmapLinkedArtRoadmap.md
linked-art/LinkedArtRoadmap.md
10/20/2018, 1:46:40 AM0e52e71c6bd28
linked-art/LinkedArtSaaSLinkedArtSaaS.md
linked-art/LinkedArtSaaS.md
10/20/2018, 1:46:40 AM03d260738fb29
linked-art/LinkedArtSoftwareCode and Tools
linked-art/LinkedArtSoftware.md
10/20/2018, 1:46:40 AM89e8fef24aea9
linked-art/LinkedArtSOTAWebAppLinkedArt SOTA Web App — Master Build Specification
linked-art/LinkedArtSOTAWebApp.md
10/20/2018, 1:46:40 AM129a5f0baca89c6
linked-art/LinkedArtUnmetNeedsLinkedArtUnmetNeeds.md
linked-art/LinkedArtUnmetNeeds.md
10/20/2018, 1:46:40 AM0cb35fac29cc1
linked-art/LinkedArtUseCasesLinkedArtUseCases.md
linked-art/LinkedArtUseCases.md
10/20/2018, 1:46:40 AM05c572ce8e7f3
linked-art/LinkedArtWidgetsLinkedArtWidgets.md
linked-art/LinkedArtWidgets.md
10/20/2018, 1:46:40 AM0b39911c7d97d
linked-art/LinkedDesignLinkedDesign.md
linked-art/LinkedDesign.md
10/20/2018, 1:46:40 AM00a02240471e5
linked-art/LODEngineLODEngine.md
linked-art/LODEngine.md
10/20/2018, 1:46:40 AM0ef73426f80db
linked-art/LODPipelineLODPipeline.md
linked-art/LODPipeline.md
10/20/2018, 1:46:40 AM0fe95e61ed9da
linked-art/LODToolsLODTools.md
linked-art/LODTools.md
10/20/2018, 1:46:40 AM03167947fc4e4
linked-art/SPARQLSPARQL.md
linked-art/SPARQL.md
10/20/2018, 1:46:40 AM050e00ed51733
linked-art/VocabulariesVocabularies.md
linked-art/Vocabularies.md
10/20/2018, 1:46:40 AM0e0574a338aaa
linked-art/YaleLuxYaleLux.md
linked-art/YaleLux.md
10/20/2018, 1:46:40 AM074fd47fae749
meta-wiki-art-bridgeMeta Wiki Art Bridge (MediaWiki + Wikibase)
meta-wiki-art-bridge.md
10/20/2018, 1:46:40 AM77a43fb0c48b8
ops/activity-adoption-proofActivity Feed Adoption Proof Runbook
ops/activity-adoption-proof.md
10/20/2018, 1:46:40 AM568a80b43ae58
ops/ag2-workerAG2 Worker and Bridge Runbook
ops/ag2-worker.md
10/20/2018, 1:46:40 AM950efcd4e3318
ops/auth-credential-rotationAuth credential rotation runbook
ops/auth-credential-rotation.md
10/20/2018, 1:46:40 AM4449b8b8eecb6
ops/deployment-preflightDeployment Preflight Runbook
ops/deployment-preflight.md
10/20/2018, 1:46:40 AM5ac60432d0aed
ops/era-c-exit-gate-evidenceEra C Exit-Gate Evidence Pack
ops/era-c-exit-gate-evidence.md
10/20/2018, 1:46:40 AM6656b9c7f85c6
ops/go-live-checklistGo-Live & Evidence-Pipeline Checklist
ops/go-live-checklist.md
10/20/2018, 1:46:40 AM6ae7f5d71f7dc
ops/k6-slok6 SLO Load Test (SOTA §20.4)
ops/k6-slo.md
10/20/2018, 1:46:40 AM4328b5b3163d4
ops/kpi-evidenceSOTA §26 KPI Evidence Input
ops/kpi-evidence.md
10/20/2018, 1:46:40 AM5d7b2973d2927
ops/launch-reviewLaunch Review Packet
ops/launch-review.md
10/20/2018, 1:46:40 AM5880e41ebcbe3
ops/managed-linked-art-pilot-runbookManaged Linked Art Pilot Runbook
ops/managed-linked-art-pilot-runbook.md
10/20/2018, 1:46:40 AM11d4f125c2ddae
ops/otel-localLocal OpenTelemetry Wiring (Tempo / Jaeger)
ops/otel-local.md
10/20/2018, 1:46:40 AM51ebbc3b33f92
ops/outbox-projectorTransactional Outbox Projector (Postgres -> Solr/GraphDB)
ops/outbox-projector.md
10/20/2018, 1:46:40 AM5dc70ad766471
ops/procurement-readiness-packetProcurement Readiness Packet
ops/procurement-readiness-packet.md
10/20/2018, 1:46:40 AM9c5685e82cca7
ops/reconciliation-serviceReconciliation Service (C2)
ops/reconciliation-service.md
10/20/2018, 1:46:40 AM605162c313ea9
ops/search-graph-provisioningSolr 9 + GraphDB Provisioning
ops/search-graph-provisioning.md
10/20/2018, 1:46:40 AM6fc1b15279a84
ops/security-dr-drillPen Test Baseline + DR Drill Runbook
ops/security-dr-drill.md
10/20/2018, 1:46:40 AM3a766ef3e2afc
progress/2026-05-31/era-c-readiness-snapshotEra C Readiness Snapshot (May 31, 2026)
progress/2026-05-31/era-c-readiness-snapshot.md
10/20/2018, 1:46:40 AM39672614ceb53
progress/era-historyMeta Museum — Era Delivery History
progress/era-history.md
10/20/2018, 1:46:40 AM47cc030755d1e5
providers/harvard-art-museumsHarvard Art Museums API Integration Plan
providers/harvard-art-museums.md
10/20/2018, 1:46:40 AM11fa8b980154f5
providers/louvre-collections-jsonLouvre Collections JSON Integration Plan
providers/louvre-collections-json.md
10/20/2018, 1:46:40 AM11775f91a8d813
providers/nga-open-dataNational Gallery of Art (NGA) Open Data Integration Plan
providers/nga-open-data.md
10/20/2018, 1:46:40 AM1151c4807c8de0
providers/princeton-art-museumPrinceton University Art Museum API Integration Plan
providers/princeton-art-museum.md
10/20/2018, 1:46:40 AM11c8823f65ee41
providers/rkd-knowledge-graphRKD Knowledge Graph Integration Plan
providers/rkd-knowledge-graph.md
10/20/2018, 1:46:40 AM162b4b42f2ad42
providers/smithsonian-open-accessSmithsonian Open Access Integration Plan
providers/smithsonian-open-access.md
10/20/2018, 1:46:40 AM12db1ffa4cab02
providers/vanda-collections-apiVictoria and Albert Museum (V&A) Collections API Integration Plan
providers/vanda-collections-api.md
10/20/2018, 1:46:40 AM11755d93972233
qualityQuality & Performance
quality.md
10/20/2018, 1:46:40 AM6174add040960
reconciliation/exhibition-literature-reconciliationExhibition + Literature Reconciliation (B6.1)
reconciliation/exhibition-literature-reconciliation.md
10/20/2018, 1:46:40 AM7293e9d81dd7c
responsible-aiResponsible AI
responsible-ai.md
10/20/2018, 1:46:40 AM8f90006650821
risk-registerRisk Register
risk-register.md
10/20/2018, 1:46:40 AM4becb213d5c5e
roadmap-to-10Roadmap to 10/10
roadmap-to-10.md
10/20/2018, 1:46:40 AM1540a11000dc7e
roadmapMeta Museum Roadmap
roadmap.md
10/20/2018, 1:46:40 AM18145d0cbbe54a
rsi-wikiAI-RSI compounding wiki
rsi-wiki.md
10/20/2018, 1:46:40 AM8b64914fe6f20
wikibase-cloud-migration-checklistWikibase Cloud -> Self-Host Migration Checklist
wikibase-cloud-migration-checklist.md
10/20/2018, 1:46:40 AM12170657fcbf2b

    Current Document: Golden Eval Dataset: Complex Museum Questions

    Source updated 10/20/2018, 1:46:40 AM · SHA-256 2876a2b5e78d · 168 lines

    Canonical ID: evals/golden-museum-questions

    JSON for this doc:/api/docs/content?path=evals/golden-museum-questions.md

    Human link:/docs?doc=evals%2Fgolden-museum-questions.md

    Canonical API endpoint:/api/docs/content?path=evals%2Fgolden-museum-questions.md

    Sections (stable anchors):

    Golden Eval Dataset: Complex Museum Questions

    This document defines the golden dataset used for AI/LLM reliability checks on grounding and citation behavior.

    Dataset file:

    • `evals/golden-museum-questions.v1.json`

    Scope

    • Minimum prompt count: `100` (current: `120`)
    • Focus: complex museum/research questions where grounding quality is high risk
    • Required behavior: `cite-or-refuse`

    Prompt contract

    Each prompt row includes:

    • stable `id` (`MM-EVAL-xxxx`)
    • `domain` and `difficulty`
    • full natural-language `prompt`
    • `expectedBehavior` with:
    • grounding requirements
    • per-claim citation requirements
    • refusal behavior when evidence is missing
    • explicit failure behaviors (`mustNot`)

    Citation expectations are strict:

    • per-claim citations
    • entity identifier references
    • property-path scoped references
    • source URL and retrieval timestamp

    Reliability rubric

    Dataset-level thresholds:

    • `groundedPrecisionThreshold = 0.95`
    • `citationCoverageThreshold = 0.95`
    • `citationFreshnessThreshold = 0.95`
    • refusal policy = `cite-or-refuse`

    Conformance enforcement

    Executable quality test:

    • `tests/quality/ai-eval-golden-dataset.test.ts`

    The test enforces:

    • prompt minimum (`>=100`)
    • prompt ID uniqueness
    • domain diversity
    • required grounding/citation/refusal fields for every prompt row

    CI gate (AI-layer PRs)

    Eval harness:

    • `scripts/ai-eval-gate.ts`
    • `src/services/ai-eval-harness.ts`

    Commands:

    • `pnpm ai:eval:report` (writes artifact without failing)
    • `pnpm ai:eval:gate` (fails when thresholds/regressions are not met)
    • `pnpm ai:eval:baseline:record` (records/updates baseline for the current dataset/model/prompt identity)

    Workflow:

    • `.github/workflows/ai-eval-gate.yml`

    Gate metrics:

    • faithfulness
    • relevance
    • citation accuracy
    • citation freshness

    Output artifact:

    • `artifacts/evals/ai-eval-gate-latest.json`
    • `artifacts/evals/runs/ai-eval-gate-<timestamp>.json`
    • `artifacts/evals/trend-index.json`
    • `artifacts/evals/summary.md`
    • `artifacts/evals/summary.md` includes review priority when at least two retained runs exist.
    • `/api/ai-evals/summary` exposes the same review priority and severity history as JSON for agents.
    • `/api/openapi` references `AiEvalSummaryResponse` for `/api/ai-evals/summary`, and route tests parse payloads with `aiEvalSummaryResponseSchema`.
    • `/api/ai-evals/summary` returns `schemaVersion: 1`; unsupported versions must be rejected by agent contract tests before consumption.

    Schema migration notes:

    • `schemaVersion: 1` is the current agent contract for `/api/ai-evals/summary`.
    • `schemaVersion: 2` is reserved for future additive agent fields or review-priority metadata changes that cannot be represented safely in v1.
    • Do not accept v2 payloads until a v2 parser, v2 fixture, OpenAPI schema update, and migration notes land in the same RSI cycle.
    • Fixture compatibility tests keep `tests/fixtures/ai-eval-summary/schema-v1-ready.json` accepted and `schema-v2-planned.json` rejected until that migration exists.

    CI summary behavior:

    • `.github/workflows/ai-eval-gate.yml` appends `artifacts/evals/summary.md` to `$GITHUB_STEP_SUMMARY`.
    • The same workflow uploads `artifacts/evals/` as `ai-eval-artifacts`.
    • Summary badges include status, faithfulness, relevance, citation accuracy, `citationFreshness`, and pass rate.
    • Freshness-aging alerts warn when `citationFreshness` stays flat/green while the oldest cited evidence approaches the max policy window.
    • Review priority summarizes the latest-vs-previous severity (`regression`, `watch`, `stable`, or `improved`) directly in the CI summary.
    • Malformed `diffSeverityPolicy` values fail validation instead of silently falling back.
    • CI logs emit a GitHub warning annotation when review priority is `watch` or `regression`.

    CI annotation examples:

    
    ::warning title=AI eval regression::Review priority is regression. current-run vs previous-run. Review severity%3A regression.
    ::warning title=AI eval watch::Review priority is watch. current-run vs previous-run. Review severity%3A watch.
    

    Dashboard:

    • `/ai-evals` is the read-only local artifact dashboard for humans and AI-assisted operators.
    • It reads ignored local `artifacts/evals/*` files when present and shows an explicit non-failing empty state when absent.
    • It surfaces latest metrics, run identity, `citationFreshness`, freshness-aging pressure, active warnings, trend history, and artifact timestamps.
    • It includes a latest-vs-previous diff for retained runs with metric deltas, freshness-aging pressure movement, and “what changed” notes for fast review.
    • It labels latest-vs-previous diffs as `regression`, `watch`, `stable`, or `improved` using explicit metric-drop and freshness-aging thresholds.
    • It reads those thresholds from `config/ai-eval-regression-policy.json` and shows policy-window severity distribution counts.
    • It shows a compact severity sparkline/history where `!` = regression, `~` = watch, `=` = stable, and `+` = improved.

    Retention controls:

    • `METAMUSEUM_EVAL_RETENTION_MAX_RUNS` (default `200`)
    • optional CLI override: `--retention-max-runs=<n>`
    • `severityHistoryPolicy.maxComparisons` in `config/ai-eval-regression-policy.json` controls how many latest-vs-previous severity comparisons are shown to humans and agents.
    • Orphaned JSON files under `artifacts/evals/runs/` are pruned when they fall outside the retained trend index; non-JSON operator notes are preserved.
    • Retention pruning reports `delete` or `dry-run` mode in `artifacts/evals/summary.md`, including retained, orphaned, deleted, and preserved-file counts.
    • The latest retention pruning report is also exposed through `/api/ai-evals/summary` as `summary.retentionPruneReport` for agent consumers.
    • Generated CI summary markdown is snapshot-locked by `tests/fixtures/ai-eval-summary/summary-snapshot.md`.
    • `/api/openapi` includes the AI eval summary schema migration compatibility table so agents can discover current and planned summary versions.

    Regression thresholds + fail-fast citation drift

    Policy file:

    • `config/ai-eval-regression-policy.json`

    Policy includes:

    • absolute floor thresholds (`faithfulness`, `relevance`, `citationAccuracy`, `citationFreshness`, `passRate`)
    • drift thresholds (`max*Drop`) for regression comparisons
    • diff severity thresholds (`diffSeverityPolicy`) for dashboard/CI review priority
    • `diffSeverityPolicy` schema/order validation before severity classification
    • severity-history retention window (`severityHistoryPolicy.maxComparisons`) with explicit malformed-policy failure tests
    • `failFastOnCitationDrift` control
    • `failFastOnCitationFreshnessDrift` control
    • versioned baselines keyed by:
    • dataset ID + dataset version
    • model version
    • prompt version

    Version identity env vars:

    • `METAMUSEUM_EVAL_MODEL_VERSION`
    • `METAMUSEUM_EVAL_PROMPT_VERSION`
    • `METAMUSEUM_EVAL_RETENTION_MAX_RUNS`

    Fail-fast behavior:

    • If citation accuracy drop exceeds `maxCitationAccuracyDrop`, gate exits immediately with failure.
    • If citation freshness drop exceeds `maxCitationFreshnessDrop`, gate exits immediately with failure.
    • If citation freshness is flat or improved while `oldestAgeRatio` crosses the aging-pressure threshold, the summary emits a `citation_freshness_aging_pressure` warning for proactive review.

    AI/agent quick endpoints