Beyond Next-Token Prediction: A Performance Characterization of Diffusion versus Autoregressive Language Models
2510.04146v1
cs.LG, cs.AI, cs.CL
2025-10-08
Авторы:
Minseo Kim, Coleman Hooper, Aditya Tomar, Chenfeng Xu, Mehrdad Farajtabar, Michael W. Mahoney, Kurt Keutzer, Amir Gholami
Abstract
Large Language Models (LLMs) have achieved state-of-the-art performance on a
broad range of Natural Language Processing (NLP) tasks, including document
processing and coding. Autoregressive Language Models (ARMs), which generate
tokens sequentially conditioned on all previous tokens, have been the
predominant paradigm for LLMs. However, while these networks have achieved high
accuracy across a range of downstream tasks, they exhibit low arithmetic
intensity due to the inherent sequential dependency with next-token prediction.
Recently, Diffusion Language Models (DLMs) have emerged as a promising
alternative architecture. DLMs generate output text in parallel, breaking the
limitations of sequential dependency. However, the performance implications of
DLMs relative to commonly deployed ARMs are not fully understood. In this work,
we present a comprehensive performance study analyzing the performance
characteristics of ARMs and DLMs, using both theoretical analysis and profiling
data to characterize the trade-offs between these approaches. We illustrate
that although DLMs exhibit higher arithmetic intensity compared to ARMs because
of their capability to utilize parallelism across sequence lengths, they fail
to scale effectively to longer contexts. We then explore DLMs with block-wise
decoding, outlining how this approach allows for increased arithmetic
intensity, while still scaling well to long contexts (similar to ARMs). We also
show interesting trade-offs for batched inference, where we find that ARMs
exhibit superior throughput, as they benefit more from parallelism across
sequences in the batch. Finally, we highlight opportunities for accelerating
DLM inference, and, in particular, highlight the importance of reducing the
number of sampling steps for allowing open-source DLMs to provide improved
latency relative to ARMs.
Ссылки и действия
Дополнительные ресурсы: