The Borderlands Atlas

Technical Safety·Exploring·Last reviewed May 1, 2026

This page is a stub. I’ve marked the territory but haven’t written my views here yet. The headings below are placeholders — the actual beliefs, uncertainties, and evidence are still in my notes. If you want my current take on this topic before it lands here, get in touch.

Where I currently stand

Current beliefs

Weak-to-strong generalisation is currently more useful as a measurement tool than as an alignment method. ~XX% — <why>.
<Claim about debate / RRM / DA.> ~XX% — <why>.
<Claim about how oversight methods compose with control.> ~XX% — <why>.

Uncertainties

Which oversight methods survive when the gap between supervisor and supervised becomes large? Why it matters: the whole agenda is conditioned on this being non-empty.
Is "scalable oversight" the right framing, or should the field move to "scalable specification"? Why it matters: changes which research bets pay off.

What would update me

A clean empirical demonstration that one oversight method survives a meaningful capability gap on a non-toy task.
Replication failures of high-profile scalable-oversight results.

Recent reading

<date> — <title> — <takeaway>.

Related writing

No essays tagged with this topic yet.

Related regions

AI Control Monitoring Evaluations