Benchmarking Egocentric Multimodal Goal Inference for Assistive Wearable Agents
2510.22443v1
cs.CV, cs.LG
2025-10-29
Авторы:
Vijay Veerabadran, Fanyi Xiao, Nitin Kamra, Pedro Matias, Joy Chen, Caley Drooff, Brett D Roads, Riley Williams, Ethan Henderson, Xuanyi Zhao, Kevin Carlberg, Joseph Tighe, Karl Ridgeway
Abstract
There has been a surge of interest in assistive wearable agents: agents
embodied in wearable form factors (e.g., smart glasses) who take assistive
actions toward a user's goal/query (e.g. "Where did I leave my keys?"). In this
work, we consider the important complementary problem of inferring that goal
from multi-modal contextual observations. Solving this "goal inference" problem
holds the promise of eliminating the effort needed to interact with such an
agent. This work focuses on creating WAGIBench, a strong benchmark to measure
progress in solving this problem using vision-language models (VLMs). Given the
limited prior work in this area, we collected a novel dataset comprising 29
hours of multimodal data from 348 participants across 3,477 recordings,
featuring ground-truth goals alongside accompanying visual, audio, digital, and
longitudinal contextual observations. We validate that human performance
exceeds model performance, achieving 93% multiple-choice accuracy compared with
84% for the best-performing VLM. Generative benchmark results that evaluate
several families of modern vision-language models show that larger models
perform significantly better on the task, yet remain far from practical
usefulness, as they produce relevant goals only 55% of the time. Through a
modality ablation, we show that models benefit from extra information in
relevant modalities with minimal performance degradation from irrelevant
modalities.
Ссылки и действия
Дополнительные ресурсы: