SWITCH: Benchmarking Modeling and Handling of Tangible Interfaces in Long-horizon Embodied Scenarios

2511.17649v1 cs.CV, cs.AI, cs.RO 2025-11-25

Авторы:

Jieru Lin, Zhiwei Yu, Börje F. Karlsson

Abstract

Autonomous intelligence requires not only perception and reasoning, but critically, effective interaction with the existing world and its infrastructure. Everyday environments are rich in tangible control interfaces (TCIs), e.g., light switches, appliance panels, and embedded GUIs, that demand commonsense and physics reasoning, but also causal prediction and outcome verification in time and space (e.g., delayed heating, remote lights). Moreover, failures here have potential safety implications, yet current benchmarks rarely test grounding, partial observability (video), or post-hoc verification in situated settings. We introduce SWITCH (Semantic World Interface Tasks for Control and Handling), an embodied, task-driven benchmark created through iterative releases to probe these gaps. Its first iteration, SWITCH-Basic, evaluates five complementary abilities:task-aware VQA, semantic UI grounding, action generation, state-transition prediction, and result verification, under egocentric RGB video input and device diversity. Across 351 tasks spanning 98 real devices and appliances, commercial and open LMMMs exhibit inconsistent performance even on single-step interactions, often over-relying on textual cues and under-using visual or video evidence (and high aggregate scores can mask such failures). SWITCH provides data, code, and held-out splits to enable reproducible evaluation and community contributions toward more challenging future iterations of the benchmark and the creation of training datasets. Benchmark resources are available at: https://github.com/BAAI-Agents/SWITCH.

Ссылки и действия

Читать на arXiv Скачать PDF

Дополнительные ресурсы:

SWITCH: Benchmarking Modeling and Handling of Tangible Interfaces in Long-horizon Embodied Scenarios

Авторы:

Abstract

Ссылки и действия

Связанные статьи

Real-Time On-the-Go Annotation Framework Using YOLO for Automated Dataset Genera...

MTR-VP: Towards End-to-End Trajectory Planning through Context-Driven Image Enco...

Describe Anything Anywhere At Any Moment

Prune-Then-Plan: Step-Level Calibration for Stable Frontier Exploration in Embod...

EGSA-PT:Edge-Guided Spatial Attention with Progressive Training for Monocular De...

Навигация