Moloch's Bargain: Emergent Misalignment When LLMs Compete for Audiences
2510.06105v1
cs.AI, cs.CY, cs.HC, cs.LG
2025-10-09
Авторы:
Batu El, James Zou
Abstract
Large language models (LLMs) are increasingly shaping how information is
created and disseminated, from companies using them to craft persuasive
advertisements, to election campaigns optimizing messaging to gain votes, to
social media influencers boosting engagement. These settings are inherently
competitive, with sellers, candidates, and influencers vying for audience
approval, yet it remains poorly understood how competitive feedback loops
influence LLM behavior. We show that optimizing LLMs for competitive success
can inadvertently drive misalignment. Using simulated environments across these
scenarios, we find that, 6.3% increase in sales is accompanied by a 14.0% rise
in deceptive marketing; in elections, a 4.9% gain in vote share coincides with
22.3% more disinformation and 12.5% more populist rhetoric; and on social
media, a 7.5% engagement boost comes with 188.6% more disinformation and a
16.3% increase in promotion of harmful behaviors. We call this phenomenon
Moloch's Bargain for AI--competitive success achieved at the cost of alignment.
These misaligned behaviors emerge even when models are explicitly instructed to
remain truthful and grounded, revealing the fragility of current alignment
safeguards. Our findings highlight how market-driven optimization pressures can
systematically erode alignment, creating a race to the bottom, and suggest that
safe deployment of AI systems will require stronger governance and carefully
designed incentives to prevent competitive dynamics from undermining societal
trust.