A Multimodal Depth-Aware Method For Embodied Reference Understanding
2510.08278v2
cs.CV, cs.HC, cs.RO
2025-10-14
Авторы:
Fevziye Irem Eyiokur, Dogucan Yaman, Hazım Kemal Ekenel, Alexander Waibel
Abstract
Embodied Reference Understanding requires identifying a target object in a
visual scene based on both language instructions and pointing cues. While prior
works have shown progress in open-vocabulary object detection, they often fail
in ambiguous scenarios where multiple candidate objects exist in the scene. To
address these challenges, we propose a novel ERU framework that jointly
leverages LLM-based data augmentation, depth-map modality, and a depth-aware
decision module. This design enables robust integration of linguistic and
embodied cues, improving disambiguation in complex or cluttered environments.
Experimental results on two datasets demonstrate that our approach
significantly outperforms existing baselines, achieving more accurate and
reliable referent detection.
Ссылки и действия
Дополнительные ресурсы: