CATArena: Evaluation of LLM Agents through Iterative Tournament Competitions
2510.26852v1
cs.AI, cs.CL
2025-11-04
Авторы:
Lingyue Fu, Xin Ding, Yaoming Zhu, Shao Zhang, Lin Qiu, Weiwen Liu, Weinan Zhang, Xuezhi Cao, Xunliang Cai, Jiaxin Ding, Yong Yu
Abstract
Large Language Model (LLM) agents have evolved from basic text generation to
autonomously completing complex tasks through interaction with external tools.
However, current benchmarks mainly assess end-to-end performance in fixed
scenarios, restricting evaluation to specific skills and suffering from score
saturation and growing dependence on expert annotation as agent capabilities
improve. In this work, we emphasize the importance of learning ability,
including both self-improvement and peer-learning, as a core driver for agent
evolution toward human-level intelligence. We propose an iterative, competitive
peer-learning framework, which allows agents to refine and optimize their
strategies through repeated interactions and feedback, thereby systematically
evaluating their learning capabilities. To address the score saturation issue
in current benchmarks, we introduce CATArena, a tournament-style evaluation
platform featuring four diverse board and card games with open-ended scoring.
By providing tasks without explicit upper score limits, CATArena enables
continuous and dynamic evaluation of rapidly advancing agent capabilities.
Experimental results and analyses involving both minimal and commercial code
agents demonstrate that CATArena provides reliable, stable, and scalable
benchmarking for core agent abilities, particularly learning ability and
strategy coding.
Ссылки и действия
Дополнительные ресурсы: