Understanding Multi-View Transformers
2510.24907v1
cs.CV, cs.AI, cs.LG
2025-10-31
Авторы:
Michal Stary, Julien Gaubil, Ayush Tewari, Vincent Sitzmann
Abstract
Multi-view transformers such as DUSt3R are revolutionizing 3D vision by
solving 3D tasks in a feed-forward manner. However, contrary to previous
optimization-based pipelines, the inner mechanisms of multi-view transformers
are unclear. Their black-box nature makes further improvements beyond data
scaling challenging and complicates usage in safety- and reliability-critical
applications. Here, we present an approach for probing and visualizing 3D
representations from the residual connections of the multi-view transformers'
layers. In this manner, we investigate a variant of the DUSt3R model, shedding
light on the development of its latent state across blocks, the role of the
individual layers, and suggest how it differs from methods with stronger
inductive biases of explicit global pose. Finally, we show that the
investigated variant of DUSt3R estimates correspondences that are refined with
reconstructed geometry. The code used for the analysis is available at
https://github.com/JulienGaubil/und3rstand .
Ссылки и действия
Дополнительные ресурсы: