Mitigating Coordinate Prediction Bias from Positional Encoding Failures
2510.22102v1
cs.CV, cs.AI, cs.CL
2025-10-29
Авторы:
Xingjian Tao, Yiwei Wang, Yujun Cai, Yihong Luo, Jing Tang
Abstract
Multimodal large language models (MLLMs) excel at vision-language tasks such
as VQA and document understanding, yet precise coordinate prediction remains
challenging. High-resolution inputs exacerbate this difficulty by producing
long token sequences that weaken positional encodings and introduce directional
biases in coordinate outputs. We investigate this phenomenon by analyzing how
MLLMs behave when visual positional encodings (VPEs) are deliberately perturbed
through shuffling. Our analysis reveals that such perturbations induce
predictable, non-random coordinate biases rather than random errors, suggesting
that models rely on internal positional priors when spatial grounding signals
are degraded. Crucially, we observe similar directional error patterns in
natural high-resolution datasets, indicating that positional encoding failures
are a key bottleneck for accurate coordinate prediction at scale. To address
this issue, we propose Vision-PE Shuffle Guidance (VPSG), a training-free
test-time method that leverages the directional nature of these biases for
correction. VPSG runs auxiliary decoding with shuffled VPEs to isolate
position-unconditioned tendencies, then uses this as negative evidence to guide
digit prediction while preserving coordinate format through a lightweight
finite-state machine. Experiments on ScreenSpot-Pro demonstrate reliable
improvements, highlighting positional encoding robustness as a critical factor
for spatial reasoning in MLLMs.
Ссылки и действия
Дополнительные ресурсы: