The Borderlands Atlas

Technical Safety·Exploring·Last reviewed May 1, 2026

This page is a stub. I’ve marked the territory but haven’t written my views here yet. The headings below are placeholders — the actual beliefs, uncertainties, and evidence are still in my notes. If you want my current take on this topic before it lands here, get in touch.

Where I currently stand

Current beliefs

Mech interp is necessary but not sufficient for a deployment-time safety case. ~XX% — interpretability findings are too brittle and too narrow to be the load-bearing element of a safety case yet.
The most useful near-term mech interp is the kind that supports monitoring, not the kind that aims for full mechanistic understanding. ~XX% — <why>.
<Your view on SAEs / circuits / probes.> ~XX% — <why>.

Uncertainties

Does mech interp scale to frontier-size models in a deployment-relevant way, or stay a research toy? Why it matters: determines whether interp is part of the operational safety story or only the science one.
Are the abstractions we extract real features of the model's computation or artefacts of our methods? Why it matters: if the latter, interp findings don't generalise out of the lab.

What would update me

A demonstrated interpretability-based detector that catches real misalignment before behavioural symptoms would meaningfully strengthen the case.
Repeated failures to robustly localise simple behaviours in frontier-scale models would push me to treat mech interp as primarily a science project.

Recent reading

<date> — <title> — <takeaway>.

Related writing

No essays tagged with this topic yet.

Related regions

Monitoring AI Control Personas