True Self-Supervised Novel View Synthesis is Transferable
2510.13063v1
cs.CV, cs.AI, cs.LG
2025-10-17
Авторы:
Thomas W. Mitchel, Hyunwoo Ryu, Vincent Sitzmann
Abstract
In this paper, we identify that the key criterion for determining whether a
model is truly capable of novel view synthesis (NVS) is transferability:
Whether any pose representation extracted from one video sequence can be used
to re-render the same camera trajectory in another. We analyze prior work on
self-supervised NVS and find that their predicted poses do not transfer: The
same set of poses lead to different camera trajectories in different 3D scenes.
Here, we present XFactor, the first geometry-free self-supervised model capable
of true NVS. XFactor combines pair-wise pose estimation with a simple
augmentation scheme of the inputs and outputs that jointly enables
disentangling camera pose from scene content and facilitates geometric
reasoning. Remarkably, we show that XFactor achieves transferability with
unconstrained latent pose variables, without any 3D inductive biases or
concepts from multi-view geometry -- such as an explicit parameterization of
poses as elements of SE(3). We introduce a new metric to quantify
transferability, and through large-scale experiments, we demonstrate that
XFactor significantly outperforms prior pose-free NVS transformers, and show
that latent poses are highly correlated with real-world poses through probing
experiments.
Ссылки и действия
Дополнительные ресурсы: