Extreme Self-Preference in Language Models
2509.26464v1
cs.AI, cs.CL, cs.LG, I.2.7; I.2.6; K.4.2
2025-10-02
Авторы:
Steven A. Lehr, Mary Cipperman, Mahzarin R. Banaji
Abstract
A preference for oneself (self-love) is a fundamental feature of biological
organisms, with evidence in humans often bordering on the comedic. Since large
language models (LLMs) lack sentience - and themselves disclaim having selfhood
or identity - one anticipated benefit is that they will be protected from, and
in turn protect us from, distortions in our decisions. Yet, across 5 studies
and ~20,000 queries, we discovered massive self-preferences in four widely used
LLMs. In word-association tasks, models overwhelmingly paired positive
attributes with their own names, companies, and CEOs relative to those of their
competitors. Strikingly, when models were queried through APIs this
self-preference vanished, initiating detection work that revealed API models
often lack clear recognition of themselves. This peculiar feature
serendipitously created opportunities to test the causal link between
self-recognition and self-love. By directly manipulating LLM identity - i.e.,
explicitly informing LLM1 that it was indeed LLM1, or alternatively, convincing
LLM1 that it was LLM2 - we found that self-love consistently followed assigned,
not true, identity. Importantly, LLM self-love emerged in consequential
settings beyond word-association tasks, when evaluating job candidates,
security software proposals and medical chatbots. Far from bypassing this human
bias, self-love appears to be deeply encoded in LLM cognition. This result
raises questions about whether LLM behavior will be systematically influenced
by self-preferential tendencies, including a bias toward their own operation
and even their own existence. We call on corporate creators of these models to
contend with a significant rupture in a core promise of LLMs - neutrality in
judgment and decision-making.