A Multi-Modal Neuro-Symbolic Approach for Spatial Reasoning-Based Visual Grounding in Robotics
2510.27033v1
cs.RO, cs.AI, cs.CV
2025-11-04
Авторы:
Simindokht Jahangard, Mehrzad Mohammadi, Abhinav Dhall, Hamid Rezatofighi
Abstract
Visual reasoning, particularly spatial reasoning, is a challenging cognitive
task that requires understanding object relationships and their interactions
within complex environments, especially in robotics domain. Existing
vision_language models (VLMs) excel at perception tasks but struggle with
fine-grained spatial reasoning due to their implicit, correlation-driven
reasoning and reliance solely on images. We propose a novel neuro_symbolic
framework that integrates both panoramic-image and 3D point cloud information,
combining neural perception with symbolic reasoning to explicitly model spatial
and logical relationships. Our framework consists of a perception module for
detecting entities and extracting attributes, and a reasoning module that
constructs a structured scene graph to support precise, interpretable queries.
Evaluated on the JRDB-Reasoning dataset, our approach demonstrates superior
performance and reliability in crowded, human_built environments while
maintaining a lightweight design suitable for robotics and embodied AI
applications.
Ссылки и действия
Дополнительные ресурсы: