HomeSafeBench: A Benchmark for Embodied Vision-Language Models in Free-Exploration Home Safety Inspection
2509.23690v1
cs.CV, cs.CL
2025-10-01
Авторы:
Siyuan Gao, Jiashu Yao, Haoyu Wen, Yuhang Guo, Zeming Liu, Heyan Huang
Abstract
Embodied agents can identify and report safety hazards in the home
environments. Accurately evaluating their capabilities in home safety
inspection tasks is curcial, but existing benchmarks suffer from two key
limitations. First, they oversimplify safety inspection tasks by using textual
descriptions of the environment instead of direct visual information, which
hinders the accurate evaluation of embodied agents based on Vision-Language
Models (VLMs). Second, they use a single, static viewpoint for environmental
observation, which restricts the agents' free exploration and cause the
omission of certain safety hazards, especially those that are occluded from a
fixed viewpoint. To alleviate these issues, we propose HomeSafeBench, a
benchmark with 12,900 data points covering five common home safety hazards:
fire, electric shock, falling object, trips, and child safety. HomeSafeBench
provides dynamic first-person perspective images from simulated home
environments, enabling the evaluation of VLM capabilities for home safety
inspection. By allowing the embodied agents to freely explore the room,
HomeSafeBench provides multiple dynamic perspectives in complex environments
for a more thorough inspection. Our comprehensive evaluation of mainstream VLMs
on HomeSafeBench reveals that even the best-performing model achieves an
F1-score of only 10.23%, demonstrating significant limitations in current VLMs.
The models particularly struggle with identifying safety hazards and selecting
effective exploration strategies. We hope HomeSafeBench will provide valuable
reference and support for future research related to home security inspections.
Our dataset and code will be publicly available soon.
Ссылки и действия
Дополнительные ресурсы: