Monitoring

Detecting bad cognition or bad action at deployment time — what chain-of-thought monitorability buys us today, and what replaces it as it degrades.

Technical Safety·Developing·Last reviewed May 2, 2026

Where I currently stand

Chain-of-thought monitorability is an effective oversight tool today, but it is on a short fuse. I expect it to be displaced as the primary cognition-monitoring tool by the end of 2027 — driven by RL training pressure and emergent encoding under capability scaling — and replaced by activation-based oracles. The interesting question is no longer "does CoT monitorability work" but "how do we know we are in a regime where it still does, and what is ready to replace it when we are not?" Action monitoring (output-side) is the more reliable near-term lever, but it is bottlenecked by an action / tool-call ambiguity problem that white-box monitoring has to help resolve.

Current beliefs

  • Action monitoring (output-side, black-box) effectively catches deployment-time bad behaviour, modulo an action / tool-call ambiguity problem that pure black-box monitors cannot fully resolve. ~80% — Baker et al. and METR's monitorability evaluations show monitors catching real misbehaviour; the ambiguity problem is the residual gap and is why I'm doing white-box monitoring work alongside it.
  • CoT monitoring is a useful oversight tool today for catching misplanned reasoning, but its window is short. ~75% — Emmons et al. shows CoT monitoring is hardest to evade precisely on the hard tasks we care about, while Liu et al. and Baker et al. document the failure modes that are already showing up; the trajectory is decay.
  • CoT will become substantially more encoded by the end of 2027, driven by RL training pressure and emergent encoding under capability scaling. ~90% — direct empirical evidence and theoretical reasons (paper forthcoming); the strongest counter-evidence is that frontier labs are now investing in preserving monitorability (Guan et al.), which could push the timeline out without changing the direction.
  • (Normative.) A monitoring system whose properties are not publicly documented should not be trusted by third parties — and labs should not expect that trust without disclosure. This is the load-bearing argument for transparency-disclosure standards in my governance paper.

Uncertainties

  • Will CoT become more or less faithful, separately from becoming more encoded? Why it matters: encoding and unfaithfulness fail in different ways; an intervention that addresses one may leave the other untouched.
  • Are activation-based oracles (Anthropic's natural-language autoencoders, Mythos activation verbalizers) a viable replacement for CoT monitoring, or do they have their own faithfulness and steganography failure modes? Why it matters: this is the assumed migration target — if it doesn't pan out, the whole "prepare to switch" plan unravels.

What would update me

  • A clean demonstration that a frontier lab's training run measurably degraded CoT faithfulness would strengthen the disclosure-standards case and tighten my 2027 timeline.
  • A successful intervention preserving monitorability through long RL training would shift the central concern of the field, and weaken my 90% on the encoded-by-2027 prediction.
  • Activation-readout work producing a publicly-audited substitute for CoT monitoring would update me toward that being the right migration target rather than a hopeful guess.

Recent reading

Related writing

No essays tagged with this topic yet.

Related regions