Understanding Multi-View Transformers

2510.24907v1 cs.CV, cs.AI, cs.LG 2025-10-31

Авторы:

Michal Stary, Julien Gaubil, Ayush Tewari, Vincent Sitzmann

Abstract

Multi-view transformers such as DUSt3R are revolutionizing 3D vision by solving 3D tasks in a feed-forward manner. However, contrary to previous optimization-based pipelines, the inner mechanisms of multi-view transformers are unclear. Their black-box nature makes further improvements beyond data scaling challenging and complicates usage in safety- and reliability-critical applications. Here, we present an approach for probing and visualizing 3D representations from the residual connections of the multi-view transformers' layers. In this manner, we investigate a variant of the DUSt3R model, shedding light on the development of its latent state across blocks, the role of the individual layers, and suggest how it differs from methods with stronger inductive biases of explicit global pose. Finally, we show that the investigated variant of DUSt3R estimates correspondences that are refined with reconstructed geometry. The code used for the analysis is available at https://github.com/JulienGaubil/und3rstand .

Ссылки и действия

Читать на arXiv Скачать PDF

Дополнительные ресурсы:

Understanding Multi-View Transformers

Авторы:

Abstract

Ссылки и действия

Связанные статьи

PyroFocus: A Deep Learning Approach to Real-Time Wildfire Detection in Multispec...

ProtoEFNet: Dynamic Prototype Learning for Inherently Interpretable Ejection Fra...

GalaxyDiT: Efficient Video Generation with Guidance Alignment and Adaptive Proxy...

Divide, then Ground: Adapting Frame Selection to Query Types for Long-Form Video...

PSA: Pyramid Sparse Attention for Efficient Video Understanding and Generation

Навигация