FlashOmni: A Unified Sparse Attention Engine for Diffusion Transformers
2509.25401v1
cs.LG, cs.AI, cs.PF
2025-10-02
Авторы:
Liang Qiao, Yue Dai, Yeqi Huang, Hongyu Kan, Jun Shi, Hong An
Abstract
Multi-Modal Diffusion Transformers (DiTs) demonstrate exceptional
capabilities in visual synthesis, yet their deployment remains constrained by
substantial computational demands. To alleviate this bottleneck, many
sparsity-based acceleration methods have been proposed. However, their diverse
sparsity patterns often require customized kernels for high-performance
inference, limiting universality. We propose FlashOmni, a unified sparse
attention engine compatible with arbitrary DiT architectures. FlashOmni
introduces flexible sparse symbols to standardize the representation of a wide
range of sparsity strategies, such as feature caching and block-sparse
skipping. This unified abstraction enables the execution of diverse sparse
computations within a single attention kernel. In addition, FlashOmni designs
optimized sparse GEMMs for attention blocks, leveraging sparse symbols to
eliminate redundant computations and further improve efficiency. Experiments
demonstrate that FlashOmni delivers near-linear, closely matching the sparsity
ratio speedup (1:1) in attention and GEMM-$Q$, and achieves
2.5$\times$-3.8$\times$ acceleration in GEMM-$O$ (max peaking at about 87.5% of
the theoretical limit). Applied with a multi-granularity sparsity strategy, it
enables the Hunyuan model (33K) to achieve about 1.5$\times$ end-to-end
acceleration without degrading visual quality.
Ссылки и действия
Дополнительные ресурсы: