MASC: Boosting Autoregressive Image Generation with a Manifold-Aligned Semantic Clustering
2510.04220v1
cs.CV, cs.AI, cs.LG
2025-10-08
Авторы:
Lixuan He, Shikang Zheng, Linfeng Zhang
Abstract
Autoregressive (AR) models have shown great promise in image generation, yet
they face a fundamental inefficiency stemming from their core component: a
vast, unstructured vocabulary of visual tokens. This conventional approach
treats tokens as a flat vocabulary, disregarding the intrinsic structure of the
token embedding space where proximity often correlates with semantic
similarity. This oversight results in a highly complex prediction task, which
hinders training efficiency and limits final generation quality. To resolve
this, we propose Manifold-Aligned Semantic Clustering (MASC), a principled
framework that constructs a hierarchical semantic tree directly from the
codebook's intrinsic structure. MASC employs a novel geometry-aware distance
metric and a density-driven agglomerative construction to model the underlying
manifold of the token embeddings. By transforming the flat, high-dimensional
prediction task into a structured, hierarchical one, MASC introduces a
beneficial inductive bias that significantly simplifies the learning problem
for the AR model. MASC is designed as a plug-and-play module, and our extensive
experiments validate its effectiveness: it accelerates training by up to 57%
and significantly improves generation quality, reducing the FID of LlamaGen-XL
from 2.87 to 2.58. MASC elevates existing AR frameworks to be highly
competitive with state-of-the-art methods, establishing that structuring the
prediction space is as crucial as architectural innovation for scalable
generative modeling.
Ссылки и действия
Дополнительные ресурсы: