DeCo-VAE: Learning Compact Latents for Video Reconstruction via Decoupled Representation

2511.14530v1 cs.CV, cs.LG, cs.MM 2025-11-20

Авторы:

Xiangchen Yin, Jiahui Yuan, Zhangchi Hu, Wenzhang Sun, Jie Chen, Xiaozhen Qiao, Hao Li, Xiaoyan Sun

Abstract

Existing video Variational Autoencoders (VAEs) generally overlook the similarity between frame contents, leading to redundant latent modeling. In this paper, we propose decoupled VAE (DeCo-VAE) to achieve compact latent representation. Instead of encoding RGB pixels directly, we decompose video content into distinct components via explicit decoupling: keyframe, motion and residual, and learn dedicated latent representation for each. To avoid cross-component interference, we design dedicated encoders for each decoupled component and adopt a shared 3D decoder to maintain spatiotemporal consistency during reconstruction. We further utilize a decoupled adaptation strategy that freezes partial encoders while training the others sequentially, ensuring stable training and accurate learning of both static and dynamic features. Extensive quantitative and qualitative experiments demonstrate that DeCo-VAE achieves superior video reconstruction performance.

Ссылки и действия

Читать на arXiv Скачать PDF

Дополнительные ресурсы:

DeCo-VAE: Learning Compact Latents for Video Reconstruction via Decoupled Representation

Авторы:

Abstract

Ссылки и действия

Связанные статьи

From Pixels to Feelings: Aligning MLLMs with Human Cognitive Perception of Image...

Calibrated Multimodal Representation Learning with Missing Modalities

Post-surgical Endometriosis Segmentation in Laparoscopic Videos

MCE: Towards a General Framework for Handling Missing Modalities under Imbalance...

Zero-shot image privacy classification with Vision-Language Models

Навигация