Generative Value Conflicts Reveal LLM Priorities
2509.25369v1
cs.CL, cs.AI, cs.LG
2025-10-02
Авторы:
Andy Liu, Kshitish Ghate, Mona Diab, Daniel Fried, Atoosa Kasirzadeh, Max Kleiman-Weiner
Abstract
Past work seeks to align large language model (LLM)-based assistants with a
target set of values, but such assistants are frequently forced to make
tradeoffs between values when deployed. In response to the scarcity of value
conflict in existing alignment datasets, we introduce ConflictScope, an
automatic pipeline to evaluate how LLMs prioritize different values. Given a
user-defined value set, ConflictScope automatically generates scenarios in
which a language model faces a conflict between two values sampled from the
set. It then prompts target models with an LLM-written "user prompt" and
evaluates their free-text responses to elicit a ranking over values in the
value set. Comparing results between multiple-choice and open-ended
evaluations, we find that models shift away from supporting protective values,
such as harmlessness, and toward supporting personal values, such as user
autonomy, in more open-ended value conflict settings. However, including
detailed value orderings in models' system prompts improves alignment with a
target ranking by 14%, showing that system prompting can achieve moderate
success at aligning LLM behavior under value conflict. Our work demonstrates
the importance of evaluating value prioritization in models and provides a
foundation for future work in this area.
Ссылки и действия
Дополнительные ресурсы: