Open Character Training: Shaping the Persona of AI Assistants through Constitutional AI
2511.01689v1
cs.CL, cs.AI, cs.LG
2025-11-06
Авторы:
Sharan Maiya, Henning Bartsch, Nathan Lambert, Evan Hubinger
Abstract
The character of the "AI assistant" persona generated by modern chatbot large
language models influences both surface-level behavior and apparent values,
beliefs, and ethics. These all affect interaction quality, perceived
intelligence, and alignment with both developer and user intentions. The
shaping of this persona, known as character training, is a critical component
of industry post-training, yet remains effectively unstudied in the academic
literature. We introduce the first open implementation of character training,
leveraging Constitutional AI and a new data pipeline using synthetic
introspective data to shape the assistant persona in a more effective and
controlled manner than alternatives such as constraining system prompts or
activation steering. Specifically, we fine-tune three popular open-weights
models using 11 example personas, such as humorous, deeply caring, or even
malevolent. To track the effects of our approach, we introduce a method which
analyzes revealed preferences, uncovering clear and holistic changes in
character. We find these changes are more robust to adversarial prompting than
the above two alternatives, while also leading to more coherent and realistic
generations. Finally, we demonstrate this fine-tuning has little to no effect
on general capabilities as measured by common benchmarks. We describe and
open-source our full post-training method, the implementation of which can be
found at https://github.com/maiush/OpenCharacterTraining.
Ссылки и действия
Дополнительные ресурсы: