Spatial-ViLT: Enhancing Visual Spatial Reasoning through Multi-Task Learning
2510.03441v1
cs.CV, cs.AI, cs.LG, 68T45, 68T10, 68T40
2025-10-08
Авторы:
Chashi Mahiul Islam, Oteo Mamo, Samuel Jacob Chacko, Xiuwen Liu, Weikuan Yu
Abstract
Vision-language models (VLMs) have advanced multimodal reasoning but still
face challenges in spatial reasoning for 3D scenes and complex object
configurations. To address this, we introduce SpatialViLT, an enhanced VLM that
integrates spatial features like depth maps, 3D coordinates, and edge maps
through a multi-task learning framework. This approach enriches multimodal
embeddings with spatial understanding. We propose two variants: SpatialViLT and
MaskedSpatialViLT, focusing on full and masked object regions, respectively.
Additionally, SpatialEnsemble combines both approaches, achieving
state-of-the-art accuracy. Our models excel in spatial reasoning categories
such as directional, topological, and proximity relations, as demonstrated on
the challenging Visual Spatial Reasoning (VSR) dataset. This work represents a
significant step in enhancing the spatial intelligence of AI systems, crucial
for advanced multimodal understanding and real-world applications.