NExT-OMNI: Towards Any-to-Any Omnimodal Foundation Models with Discrete Flow Matching
2510.13721v2
cs.CL, cs.AI, cs.CV, cs.MM
2025-10-17
Авторы:
Run Luo, Xiaobo Xia, Lu Wang, Longze Chen, Renke Shan, Jing Luo, Min Yang, Tat-Seng Chua
Abstract
Next-generation multimodal foundation models capable of any-to-any
cross-modal generation and multi-turn interaction will serve as core components
of artificial general intelligence systems, playing a pivotal role in
human-machine interaction. However, most existing multimodal models remain
constrained by autoregressive architectures, whose inherent limitations prevent
a balanced integration of understanding and generation capabilities. Although
hybrid and decoupling strategies have been explored to address these tasks
within unified frameworks separately, their redundant, non-integrated designs
limit their applicability to broader scenarios, such as cross-modal retrieval.
In this work, we introduce NExT-OMNI, an open-source omnimodal foundation model
that achieves unified modeling through discrete flow paradigms. By leveraging
metric-induced probability paths and kinetic optimal velocities, NExT-OMNI
natively supports any-to-any understanding and generation with enhanced
response efficiency, while enabling broader application scenarios through
concise unified representations rather than task-decoupled designs. Trained on
large-scale interleaved text, image, video, and audio data, NExT-OMNI delivers
competitive performance on multimodal generation and understanding benchmarks,
while outperforming prior unified models in multi-turn multimodal interaction
and cross-modal retrieval, highlighting its architectural advantages as a
next-generation multimodal foundation model. To advance further research, we
release training details, data protocols, and open-source both the code and
model checkpoints.