From the Laboratory to Real-World Application: Evaluating Zero-Shot Scene Interpretation on Edge Devices for Mobile Robotics
2511.02427v1
cs.CV, cs.RO
2025-11-06
Авторы:
Nicolas Schuler, Lea Dewald, Nick Baldig, Jürgen Graf
Abstract
Video Understanding, Scene Interpretation and Commonsense Reasoning are
highly challenging tasks enabling the interpretation of visual information,
allowing agents to perceive, interact with and make rational decisions in its
environment. Large Language Models (LLMs) and Visual Language Models (VLMs)
have shown remarkable advancements in these areas in recent years, enabling
domain-specific applications as well as zero-shot open vocabulary tasks,
combining multiple domains. However, the required computational complexity
poses challenges for their application on edge devices and in the context of
Mobile Robotics, especially considering the trade-off between accuracy and
inference time. In this paper, we investigate the capabilities of
state-of-the-art VLMs for the task of Scene Interpretation and Action
Recognition, with special regard to small VLMs capable of being deployed to
edge devices in the context of Mobile Robotics. The proposed pipeline is
evaluated on a diverse dataset consisting of various real-world cityscape,
on-campus and indoor scenarios. The experimental evaluation discusses the
potential of these small models on edge devices, with particular emphasis on
challenges, weaknesses, inherent model biases and the application of the gained
information. Supplementary material is provided via the following repository:
https://datahub.rz.rptu.de/hstr-csrl-public/publications/scene-interpretation-on-edge-devices/
Ссылки и действия
Дополнительные ресурсы: