HugAgent: Evaluating LLMs in Simulating Human-Like Individual Reasoning on Open-Ended Tasks
2510.15144v1
cs.AI, cs.CL, cs.CY
2025-10-21
Авторы:
Chance Jiajie Li, Zhenze Mo, Yuhan Tang, Ao Qu, Jiayi Wu, Kaiya Ivy Zhao, Yulu Gan, Jie Fan, Jiangbo Yu, Hang Jiang, Paul Pu Liang, Jinhua Zhao, Luis Alberto Alonso Pastor, Kent Larson
Abstract
Simulating human reasoning in open-ended tasks has been a long-standing
aspiration in AI and cognitive science. While large language models now
approximate human responses at scale, they remain tuned to population-level
consensus, often erasing the individuality of reasoning styles and belief
trajectories. To advance the vision of more human-like reasoning in machines,
we introduce HugAgent (Human-Grounded Agent Benchmark), a benchmark for
average-to-individual reasoning adaptation. The task is to predict how a
specific person would reason and update their beliefs in novel scenarios, given
partial evidence of their past views. HugAgent adopts a dual-track design: a
synthetic track for scale and systematic stress tests, and a human track for
ecologically valid, "out-loud" reasoning data. This design enables scalable,
reproducible evaluation of intra-agent fidelity: whether models can capture not
just what people believe, but how their reasoning evolves. Experiments with
state-of-the-art LLMs reveal persistent adaptation gaps, positioning HugAgent
as the first extensible benchmark for aligning machine reasoning with the
individuality of human thought. Our benchmark and chatbot are open-sourced as
HugAgent (https://anonymous.4open.science/r/HugAgent) and TraceYourThinking
(https://anonymous.4open.science/r/trace-your-thinking).
Ссылки и действия
Дополнительные ресурсы: