Technical Safety·Exploring·Last reviewed May 1, 2026
This page is a stub. I’ve marked the territory but haven’t written my views here yet. The headings below are placeholders — the actual beliefs, uncertainties, and evidence are still in my notes. If you want my current take on this topic before it lands here, get in touch.
Where I currently stand
<Headline view: debate, weak-to-strong, recursive reward modelling, deliberative alignment — which of these you take seriously, and why. Likely a hedged view since this is a research-direction question rather than a concrete-claims question.>
Current beliefs
- Weak-to-strong generalisation is currently more useful as a measurement tool than as an alignment method. ~XX% — <why>.
- <Claim about debate / RRM / DA.> ~XX% — <why>.
- <Claim about how oversight methods compose with control.> ~XX% — <why>.
Uncertainties
- Which oversight methods survive when the gap between supervisor and supervised becomes large? Why it matters: the whole agenda is conditioned on this being non-empty.
- Is "scalable oversight" the right framing, or should the field move to "scalable specification"? Why it matters: changes which research bets pay off.
What would update me
- A clean empirical demonstration that one oversight method survives a meaningful capability gap on a non-toy task.
- Replication failures of high-profile scalable-oversight results.
Recent reading
- <date> — <title> — <takeaway>.
Related writing
No essays tagged with this topic yet.