Kinaema: a recurrent sequence model for memory and pose in motion
2510.20261v1
cs.RO, cs.CV, I.2.10
2025-10-25
Авторы:
Mert Bulent Sariyildiz, Philippe Weinzaepfel, Guillaume Bono, Gianluca Monaci, Christian Wolf
Abstract
One key aspect of spatially aware robots is the ability to "find their
bearings", ie. to correctly situate themselves in previously seen spaces. In
this work, we focus on this particular scenario of continuous robotics
operations, where information observed before an actual episode start is
exploited to optimize efficiency. We introduce a new model, Kinaema, and agent,
capable of integrating a stream of visual observations while moving in a
potentially large scene, and upon request, processing a query image and
predicting the relative position of the shown space with respect to its current
position. Our model does not explicitly store an observation history, therefore
does not have hard constraints on context length. It maintains an implicit
latent memory, which is updated by a transformer in a recurrent way,
compressing the history of sensor readings into a compact representation. We
evaluate the impact of this model in a new downstream task we call "Mem-Nav".
We show that our large-capacity recurrent model maintains a useful
representation of the scene, navigates to goals observed before the actual
episode start, and is computationally efficient, in particular compared to
classical transformers with attention over an observation history.
Ссылки и действия
Дополнительные ресурсы: