Technical Safety

The engineering

Monitoring

Detecting bad cognition or bad action at deployment time — what chain-of-thought monitorability buys us today, and what replaces it as it degrades.

Last update · May 2, 2026

Personas

The persona frame as a lens on LLM generalisation — what model specs and character training shape, and what falls out for probing and steering.

Last update · May 2, 2026

Stubs

Territory marked, views not yet written

Capabilities

How fast frontier systems are becoming generally capable, and what kind of capabilities matter for safety.

Territory marked · views not yet written

Mechanistic Interpretability

How well we can read what models are doing internally, and how much that buys us in deployment.

Territory marked · views not yet written

AI Control

Containing models we don't trust to align, using protocols that work even if alignment fails.

Territory marked · views not yet written

Scalable Oversight

Training and evaluation methods that remain reliable as model capabilities outrun direct human evaluation.

Territory marked · views not yet written

Evaluations

Dangerous-capability evals, autonomy and replication evals, red-teaming — the empirical backbone of every safety case.

Territory marked · views not yet written

Multi-Agent Systems

Safety properties of systems where multiple AI agents interact — emergent dynamics, collusion, and coordination failure modes.

Territory marked · views not yet written