LIBERO-PRO: Towards Robust and Fair Evaluation of Vision-Language-Action Models Beyond Memorization
2510.03827v1
cs.CV, cs.RO
2025-10-08
Авторы:
Xueyang Zhou, Yangming Xu, Guiyao Tie, Yongchao Chen, Guowen Zhang, Duanfeng Chu, Pan Zhou, Lichao Sun
Abstract
LIBERO has emerged as a widely adopted benchmark for evaluating
Vision-Language-Action (VLA) models; however, its current training and
evaluation settings are problematic, often leading to inflated performance
estimates and preventing fair model comparison. To address these issues, we
introduce LIBERO-PRO, an extended LIBERO benchmark that systematically
evaluates model performance under reasonable perturbations across four
dimensions: manipulated objects, initial states, task instructions, and
environments. Experimental results reveal that, although existing models
achieve over 90% accuracy under the standard LIBERO evaluation, their
performance collapses to 0.0% under our generalized setting. Crucially, this
discrepancy exposes the models' reliance on rote memorization of action
sequences and environment layouts from the training set, rather than genuine
task understanding or environmental perception. For instance, models persist in
executing grasping actions when the target object is replaced with irrelevant
items, and their outputs remain unchanged even when given corrupted
instructions or even messy tokens. These findings expose the severe flaws in
current evaluation practices, and we call on the community to abandon
misleading methodologies in favor of robust assessments of model generalization
and comprehension. Our code is available at:
https://github.com/Zxy-MLlab/LIBERO-PRO.
Ссылки и действия
Дополнительные ресурсы: