Look and Tell: A Dataset for Multimodal Grounding Across Egocentric and Exocentric Views
2510.22672v2
cs.CV, cs.CL, cs.RO, I.2.10; I.2.9; I.2.7; H.5.2
2025-10-29
Авторы:
Anna Deichler, Jonas Beskow
Abstract
We introduce Look and Tell, a multimodal dataset for studying referential
communication across egocentric and exocentric perspectives. Using Meta Project
Aria smart glasses and stationary cameras, we recorded synchronized gaze,
speech, and video as 25 participants instructed a partner to identify
ingredients in a kitchen. Combined with 3D scene reconstructions, this setup
provides a benchmark for evaluating how different spatial representations (2D
vs. 3D; ego vs. exo) affect multimodal grounding. The dataset contains 3.67
hours of recordings, including 2,707 richly annotated referential expressions,
and is designed to advance the development of embodied agents that can
understand and engage in situated dialogue.