Where I currently stand
The persona frame is a useful lens on how generalisation works in language models — the right unit of analysis is "what trait did the model learn as which persona", not "what trait did the model learn." I'm 65% confident this is the dominant generalisation mechanism in practice; my own forthcoming work is testing it directly, and I can see it not being the case. The pragmatic safety leverage is not in mapping the persona landscape mechanistically — that isn't pragmatic interpretability and won't shift training decisions — but in deliberately shaping the landscape upstream through model specs and character training, and in understanding how that shape changes downstream probing and steering.
Current beliefs
- Personas are an inherent feature of frontier LLMs, emerging from training on human-preference data; models are simulators in a substantive sense, not just metaphorically. ~85% — the simulator frame (Janus's Simulators, the Persona Selection Model) explains a wide range of generalisation and interpretability findings; the Assistant Axis work makes it concretely measurable.
- Trait generalisation in LLMs is mediated by personas — models don't learn traits abstractly, they learn traits-in-the-context-of-a-persona, and this has downstream consequences for probing, steering, and safety. ~65% — the persona vectors and assistant axis lines are suggestive, but I don't yet have evidence that directly proves this is the dominant mechanism rather than one mechanism among several. My own forthcoming paper is built around testing this exact claim, and I can see it not being the case.
- Mechanistic mapping of the persona manifold for its own sake is not pragmatic interpretability and is not a primary safety lever; the leverage is in deliberately shaping the persona landscape through model specs and character training, and in studying how that shape changes the behaviour of probing and steering interventions. ~75% — the model-spec / Constitution-style approach is upstream of where mechanistic work could intervene; better persona-manifold maps are scientifically interesting but unlikely to shift training-decision or policy outcomes.
Uncertainties
- Does deeply-defined character training (the Assistant as a top-level persona under which sub-personas inherit) make trait-level representations effectively transferable across the personas the Assistant adopts — making the persona-mediated story more theoretical than practical? Why it matters: if yes, then trait-level abstractions are usable in practice and belief #2 weakens significantly; if no, then persona-context is irreplaceable in any safety intervention, and probes / steering vectors must be persona-conditional to be reliable.
- Will model specs become precise enough that persona-landscape shaping is genuinely a design lever, or do they remain too high-level to predictably constrain post-training behaviour? Why it matters: the case for belief #3 is load-bearing on this — if specs don't translate into measurable persona-stability, the mechanistic side might be the only remaining safety angle on personas.
What would update me
- A direct empirical demonstration that fine-tuning a trait at the Assistant-persona level produces stable trait expression across sub-personas would push me toward "trait-level abstraction is fine in practice" and significantly weaken my 65% on belief #2.
- Evidence that probes or steering vectors generalise robustly across personas without retraining would similarly weaken belief #2 and probably end my interest in this as a load-bearing safety direction.
- Documented cases of model-spec divergence (under-specification, ambiguity, or drift) producing measurable persona instability in deployment would strengthen belief #3 and the model-specs-as-lever framing.
Recent reading
- 2022 — Janus, Simulators (LessWrong) — the foundational framing: LLMs as simulators, personas as simulacra. Theoretical scaffolding for everything below.
- 2025 — nostalgebraist, the void (LessWrong) — extends the simulator frame to Assistant-persona underspecification; argues the Assistant starts ill-defined and concretises through deployment, with implications for character training.
- 2026 — Anthropic, The Persona Selection Model — surveys behavioural, generalisation, and interpretability evidence for the simulator / persona-selection frame; the most direct theoretical grounding for belief #1.
- 2026 — Anthropic, The Assistant Axis: Situating and Stabilizing the Default Persona of Language Models — empirical evidence that the leading component of persona space is "Assistant-likeness"; steering along the axis controls helpful / harmless behaviour. Direct evidence for the "deeply defined Assistant persona" picture that drives the central uncertainty above.
- 2025 — Chen et al. (Anthropic), Persona Vectors: Monitoring and Controlling Character Traits in Language Models — automated extraction of trait-direction vectors and their use in monitoring and steering. Important methodological piece; also makes belief #2 testable.
- Forthcoming — Will MacAskill (Forethought), essay on character of AI (link TBD) — argues character training is the right unit for thinking about long-run AI behaviour; aligns with belief #3.
- Forthcoming — my own paper on trait–persona interaction (ICML workshop submission; arXiv link TBD) — directly tests belief #2.
Related writing
No essays tagged with this topic yet.