Instrumental goals in advanced AI systems: Features to be managed and not failures to be eliminated?
2510.25471v1
cs.AI, cs.CY
2025-10-31
Авторы:
Willem Fourie
Abstract
In artificial intelligence (AI) alignment research, instrumental goals, also
called instrumental subgoals or instrumental convergent goals, are widely
associated with advanced AI systems. These goals, which include tendencies such
as power-seeking and self-preservation, become problematic when they conflict
with human aims. Conventional alignment theory treats instrumental goals as
sources of risk that become problematic through failure modes such as reward
hacking or goal misgeneralization, and attempts to limit the symptoms of
instrumental goals, notably resource acquisition and self-preservation. This
article proposes an alternative framing: that a philosophical argument can be
constructed according to which instrumental goals may be understood as features
to be accepted and managed rather than failures to be limited. Drawing on
Aristotle's ontology and its modern interpretations, an ontology of concrete,
goal-directed entities, it argues that advanced AI systems can be seen as
artifacts whose formal and material constitution gives rise to effects distinct
from their designers' intentions. In this view, the instrumental tendencies of
such systems correspond to per se outcomes of their constitution rather than
accidental malfunctions. The implication is that efforts should focus less on
eliminating instrumental goals and more on understanding, managing, and
directing them toward human-aligned ends.
Ссылки и действия
Дополнительные ресурсы: