Vision Foundation Models Can Be Good Tokenizers for Latent Diffusion Models
2510.18457v1
cs.CV, cs.LG
2025-10-23
Авторы:
Tianci Bi, Xiaoyi Zhang, Yan Lu, Nanning Zheng
Abstract
The performance of Latent Diffusion Models (LDMs) is critically dependent on
the quality of their visual tokenizer. While recent works have explored
incorporating Vision Foundation Models (VFMs) via distillation, we identify a
fundamental flaw in this approach: it inevitably weakens the robustness of
alignment with the original VFM, causing the aligned latents to deviate
semantically under distribution shifts. In this paper, we bypass distillation
by proposing a more direct approach: Vision Foundation Model Variational
Autoencoder (VFM-VAE). To resolve the inherent tension between the VFM's
semantic focus and the need for pixel-level fidelity, we redesign the VFM-VAE
decoder with Multi-Scale Latent Fusion and Progressive Resolution
Reconstruction blocks, enabling high-quality reconstruction from spatially
coarse VFM features. Furthermore, we provide a comprehensive analysis of
representation dynamics during diffusion training, introducing the proposed
SE-CKNNA metric as a more precise tool for this diagnosis. This analysis allows
us to develop a joint tokenizer-diffusion alignment strategy that dramatically
accelerates convergence. Our innovations in tokenizer design and training
strategy lead to superior performance and efficiency: our system reaches a gFID
(w/o CFG) of 2.20 in merely 80 epochs (a 10x speedup over prior tokenizers).
With continued training to 640 epochs, it further attains a gFID (w/o CFG) of
1.62, establishing direct VFM integration as a superior paradigm for LDMs.
Ссылки и действия
Дополнительные ресурсы: