AgentChangeBench: A Multi-Dimensional Evaluation Framework for Goal-Shift Robustness in Conversational AI
2510.18170v1
cs.AI, cs.ET, cs.LG, cs.SE, math.OC
2025-10-23
Авторы:
Manik Rana, Calissa Man, Anotida Expected Msiiwa, Jeffrey Paine, Kevin Zhu, Sunishchal Dev, Vasu Sharma, Ahan M R
Abstract
Goal changes are a defining feature of real world multi-turn interactions,
yet current agent benchmarks primarily evaluate static objectives or one-shot
tool use. We introduce AgentChangeBench, a benchmark explicitly designed to
measure how tool augmented language model agents adapt to mid dialogue goal
shifts across three enterprise domains. Our framework formalizes evaluation
through four complementary metrics: Task Success Rate (TSR) for effectiveness,
Tool Use Efficiency (TUE) for reliability, Tool Call Redundancy Rate (TCRR) for
wasted effort, and Goal-Shift Recovery Time (GSRT) for adaptation latency.
AgentChangeBench comprises 2,835 task sequences and five user personas, each
designed to trigger realistic shift points in ongoing workflows. Using this
setup, we evaluate several frontier models and uncover sharp contrasts obscured
by traditional $\text{pass}@k$ scores: for example, GPT-4o reaches $92.2\%$
recovery on airline booking shifts while Gemini collapses to $48.6\%$, and
retail tasks show near perfect parameter validity yet redundancy rates above
$80\%$, revealing major inefficiencies. These findings demonstrate that high
raw accuracy does not imply robustness under dynamic goals, and that explicit
measurement of recovery time and redundancy is essential. AgentChangeBench
establishes a reproducible testbed for diagnosing and improving agent
resilience in realistic enterprise settings.