HugAgent: Evaluating LLMs in Simulating Human-Like Individual Reasoning on Open-Ended Tasks

2510.15144v1 cs.AI, cs.CL, cs.CY 2025-10-21

Авторы:

Chance Jiajie Li, Zhenze Mo, Yuhan Tang, Ao Qu, Jiayi Wu, Kaiya Ivy Zhao, Yulu Gan, Jie Fan, Jiangbo Yu, Hang Jiang, Paul Pu Liang, Jinhua Zhao, Luis Alberto Alonso Pastor, Kent Larson

Abstract

Simulating human reasoning in open-ended tasks has been a long-standing aspiration in AI and cognitive science. While large language models now approximate human responses at scale, they remain tuned to population-level consensus, often erasing the individuality of reasoning styles and belief trajectories. To advance the vision of more human-like reasoning in machines, we introduce HugAgent (Human-Grounded Agent Benchmark), a benchmark for average-to-individual reasoning adaptation. The task is to predict how a specific person would reason and update their beliefs in novel scenarios, given partial evidence of their past views. HugAgent adopts a dual-track design: a synthetic track for scale and systematic stress tests, and a human track for ecologically valid, "out-loud" reasoning data. This design enables scalable, reproducible evaluation of intra-agent fidelity: whether models can capture not just what people believe, but how their reasoning evolves. Experiments with state-of-the-art LLMs reveal persistent adaptation gaps, positioning HugAgent as the first extensible benchmark for aligning machine reasoning with the individuality of human thought. Our benchmark and chatbot are open-sourced as HugAgent (https://anonymous.4open.science/r/HugAgent) and TraceYourThinking (https://anonymous.4open.science/r/trace-your-thinking).

Ссылки и действия

Читать на arXiv Скачать PDF

Дополнительные ресурсы:

HugAgent: Evaluating LLMs in Simulating Human-Like Individual Reasoning on Open-Ended Tasks

Авторы:

Abstract

Ссылки и действия

Связанные статьи

Balancing Safety and Helpfulness in Healthcare AI Assistants through Iterative P...

H-Neurons: On the Existence, Impact, and Origin of Hallucination-Associated Neur...

Dropouts in Confidence: Moral Uncertainty in Human-LLM Alignment

The Double Contingency Problem: AI Recursion and the Limits of Interspecies Unde...

Deep Value Benchmark: Measuring Whether Models Generalize Deep values or Shallow...

Навигация