AcademicEval: Live Long-Context LLM Benchmark
2510.17725v1
cs.CL, cs.AI, cs.LG
2025-10-22
Авторы:
Haozhen Zhang, Tao Feng, Pengrui Han, Jiaxuan You
Abstract
Large Language Models (LLMs) have recently achieved remarkable performance in
long-context understanding. However, current long-context LLM benchmarks are
limited by rigid context length, labor-intensive annotation, and the pressing
challenge of label leakage issues during LLM training. Therefore, we propose
\textsc{AcademicEval}, a live benchmark for evaluating LLMs over long-context
generation tasks. \textsc{AcademicEval} adopts papers on arXiv to introduce
several academic writing tasks with long-context inputs, \textit{i.e.},
\textsc{Title}, \textsc{Abstract}, \textsc{Introduction}, and \textsc{Related
Work}, which cover a wide range of abstraction levels and require no manual
labeling. Moreover, \textsc{AcademicEval} integrates high-quality and
expert-curated few-shot demonstrations from a collected co-author graph to
enable flexible context length. Especially, \textsc{AcademicEval} features an
efficient live evaluation, ensuring no label leakage. We conduct a holistic
evaluation on \textsc{AcademicEval}, and the results illustrate that LLMs
perform poorly on tasks with hierarchical abstraction levels and tend to
struggle with long few-shot demonstrations, highlighting the challenge of our
benchmark. Through experimental analysis, we also reveal some insights for
enhancing LLMs' long-context modeling capabilities. Code is available at
https://github.com/ulab-uiuc/AcademicEval
Ссылки и действия
Дополнительные ресурсы: