CanvasMAR: Improving Masked Autoregressive Video Generation With Canvas
2510.13669v1
cs.CV, cs.AI, cs.LG
2025-10-17
Авторы:
Zian Li, Muhan Zhang
Abstract
Masked autoregressive models (MAR) have recently emerged as a powerful
paradigm for image and video generation, combining the flexibility of masked
modeling with the potential of continuous tokenizer. However, video MAR models
suffer from two major limitations: the slow-start problem, caused by the lack
of a structured global prior at early sampling stages, and error accumulation
across the autoregression in both spatial and temporal dimensions. In this
work, we propose CanvasMAR, a novel video MAR model that mitigates these issues
by introducing a canvas mechanism--a blurred, global prediction of the next
frame, used as the starting point for masked generation. The canvas provides
global structure early in sampling, enabling faster and more coherent frame
synthesis. Furthermore, we introduce compositional classifier-free guidance
that jointly enlarges spatial (canvas) and temporal conditioning, and employ
noise-based canvas augmentation to enhance robustness. Experiments on the BAIR
and Kinetics-600 benchmarks demonstrate that CanvasMAR produces high-quality
videos with fewer autoregressive steps. Our approach achieves remarkable
performance among autoregressive models on Kinetics-600 dataset and rivals
diffusion-based methods.
Ссылки и действия
Дополнительные ресурсы: