Monitoring
DevelopingDetecting bad cognition or bad action at deployment time — what chain-of-thought monitorability buys us today, and what replaces it as it degrades.
Last update · May 2, 2026
The engineering
Detecting bad cognition or bad action at deployment time — what chain-of-thought monitorability buys us today, and what replaces it as it degrades.
Last update · May 2, 2026
The persona frame as a lens on LLM generalisation — what model specs and character training shape, and what falls out for probing and steering.
Last update · May 2, 2026
Territory marked, views not yet written
How fast frontier systems are becoming generally capable, and what kind of capabilities matter for safety.
Territory marked · views not yet written
How well we can read what models are doing internally, and how much that buys us in deployment.
Territory marked · views not yet written
Containing models we don't trust to align, using protocols that work even if alignment fails.
Territory marked · views not yet written
Training and evaluation methods that remain reliable as model capabilities outrun direct human evaluation.
Territory marked · views not yet written
Dangerous-capability evals, autonomy and replication evals, red-teaming — the empirical backbone of every safety case.
Territory marked · views not yet written
Safety properties of systems where multiple AI agents interact — emergent dynamics, collusion, and coordination failure modes.
Territory marked · views not yet written