Dynamic Evaluations

Eval regimes that update as capabilities shift — moving from snapshot tests to continuous, capability-tracking evaluation.

Technical Governance·Exploring·Last reviewed May 1, 2026

This page is a stub. I’ve marked the territory but haven’t written my views here yet. The headings below are placeholders — the actual beliefs, uncertainties, and evidence are still in my notes. If you want my current take on this topic before it lands here, get in touch.

Where I currently stand

<Headline view: static evals decay quickly because the environment they probe shifts faster than the eval suites; "dynamic evaluations" is the right framing for what actually-useful regulatory evals need to look like, but the operational design is wide open. Bridge category between Evals (technical) and Technical Standards (governance).>

Current beliefs

  • Static dangerous-capability evals decay faster than the standards bodies that adopt them can update. ~XX%<why>.
  • Dynamic evaluations require a continuous third-party evaluator, not just a regulator with checklist authority. ~XX%<why>.
  • <Claim about adversarial / co-evolving eval design.> ~XX%<why>.

Uncertainties

  • What is the right cadence for re-evaluation — per release, per capability cluster, or continuously? Why it matters: determines the operational shape of the regime.
  • Can dynamic-eval results be used as binding regulatory triggers without becoming gameable? Why it matters: determines whether this is a genuine governance tool or a perpetual research project.

What would update me

  • A standards body adopting a meaningfully dynamic eval regime (rather than a one-shot eval list) would prove the design space is real.
  • Public evidence that a frontier model passed the relevant eval and then exhibited the underlying capability anyway in deployment.

Recent reading

  • <date><title><takeaway>.

Related writing

No essays tagged with this topic yet.

Related regions