S$^2$Q-VDiT: Accurate Quantized Video Diffusion Transformer with Salient Data and Sparse Token Distillation

2508.04016v2 cs.CV 2025-08-09

Авторы:

Weilun Feng, Haotong Qin, Chuanguang Yang, Xiangqi Li, Han Yang, Yuqi Li, Zhulin An, Libo Huang, Michele Magno, Yongjun Xu

Резюме на русском

Данные видео диффузионных моделей (V-DMs) обладают высоким размером и высокой разностью калибровочных данных, что создает трудности для пост-тренировочной квантования. Мы предлагаем S$^2$Q-VDiT, подход к квантованию V-DMs, который использует набор высококачественных данных для калибровки, выбранных с учетом особенностей диффузионной и квантовой моделей. Мы также разработали метод сжатия токенов, ориентированный на анализ спарсинговых схем V-DMs, чтобы повысить точность модели. Наши исследования показали, что S$^2$Q-VDiT обеспечивает $3.9\times$ сжатие модели и $1.3\times$ ускорение процесса инференса с защитой высокой точности. Этот подход демонстрирует эффективность в области квантования диффузионных моделей видео, сочетая высокую точность с экономией ресурсов.

Abstract

Diffusion transformers have emerged as the mainstream paradigm for video generation models. However, the use of up to billions of parameters incurs significant computational costs. Quantization offers a promising solution by reducing memory usage and accelerating inference. Nonetheless, we observe that the joint modeling of spatial and temporal information in video diffusion models (V-DMs) leads to extremely long token sequences, which introduces high calibration variance and learning challenges. To address these issues, we propose S$^2$Q-VDiT, a post-training quantization framework for V-DMs that leverages Salient data and Sparse token distillation. During the calibration phase, we identify that quantization performance is highly sensitive to the choice of calibration data. To mitigate this, we introduce \textit{Hessian-aware Salient Data Selection}, which constructs high-quality calibration datasets by considering both diffusion and quantization characteristics unique to V-DMs. To tackle the learning challenges, we further analyze the sparse attention patterns inherent in V-DMs. Based on this observation, we propose \textit{Attention-guided Sparse Token Distillation}, which exploits token-wise attention distributions to emphasize tokens that are more influential to the model's output. Under W4A6 quantization, S$^2$Q-VDiT achieves lossless performance while delivering $3.9\times$ model compression and $1.3\times$ inference acceleration. Code will be available at https://github.com/wlfeng0509/s2q-vdit.

Ссылки и действия

Читать на arXiv Скачать PDF

Дополнительные ресурсы:

S$^2$Q-VDiT: Accurate Quantized Video Diffusion Transformer with Salient Data and Sparse Token Distillation

Авторы:

Резюме на русском

Abstract

Ссылки и действия

Связанные статьи

LM-CartSeg: Automated Segmentation of Lateral and Medial Cartilage and Subchondr...

TwinFlow: Realizing One-step Generation on Large Models with Self-adversarial Fl...

EFDiT: Efficient Fine-grained Image Generation Using Diffusion Transformer Model...

EMMA: Efficient Multimodal Understanding, Generation, and Editing with a Unified...

Live Avatar: Streaming Real-time Audio-Driven Avatar Generation with Infinite Le...

Навигация