Evaluations

Dangerous-capability evals, autonomy and replication evals, red-teaming — the empirical backbone of every safety case.

Technical Safety·Exploring·Last reviewed May 1, 2026

This page is a stub. I’ve marked the territory but haven’t written my views here yet. The headings below are placeholders — the actual beliefs, uncertainties, and evidence are still in my notes. If you want my current take on this topic before it lands here, get in touch.

Where I currently stand

<Headline view: the eval ecosystem (METR, Apollo, AISI, lab internal evals), what it can and can't tell you, and the role of evals in the safety case. The relationship to dynamic-evaluations as a governance instrument lives in the technical-governance column.>

Current beliefs

  • Most current dangerous-capability evals systematically under-elicit; "the model can't do X" is rarely actionable until elicitation has been pushed hard. ~XX%<why>.
  • Eval results without elicitation methodology are uninterpretable. ~XX%<why>.
  • <Claim about the role of third-party evaluators.> ~XX%<why>.

Uncertainties

  • Are agentic-task evals the right operational measure, or do they misallocate effort? Why it matters: shapes where third-party evaluators should focus.
  • How much do evals generalise from one frontier model family to the next? Why it matters: determines whether eval suites can be re-used across releases.

What would update me

  • Repeated cases where models pass an eval but fail in deployment on materially similar tasks would push me toward more pessimism about eval validity.
  • A standards body publishing a methodologically rigorous eval taxonomy would change how I think about the path to operational use.

Recent reading

  • <date><title><takeaway>.

Related writing

No essays tagged with this topic yet.

Related regions