Reject Only Critical Tokens: Pivot-Aware Speculative Decoding
2511.00351v1
cs.LG, cs.CL
2025-11-06
Авторы:
Amir Ziashahabi, Yavuz Faruk Bakman, Duygu Nur Yaldiz, Mostafa El-Khamy, Sai Praneeth Karimireddy, Salman Avestimehr
Abstract
Speculative Decoding (SD) ensures that the output matches the target model's
distribution exactly. However, we argue that this distribution matching
requirement is too stringent and results in unnecessarily low acceptance rates,
limiting potential speedups. Instead, we advocate a reformulation of the
decoding objective: the proposed decoding strategy should match the expected
utility, i.e., the task-specific performance, of the target model. This
perspective also aligns better with real-world use cases of LLMs, where utility
(e.g., code correctness, factual accuracy) is often more important than
sampling distribution. Based on this reformulation, we propose a novel decoding
strategy: Pivot-Aware Speculative Decoding, which rejects only those tokens
that would lead to a utility drop in the final output. We refer to these
critical tokens as pivot tokens. We propose a method for labeling tokens as
pivotal or non-pivotal and train a lightweight classifier to detect them. This
method can be viewed as a relaxed version of standard SD, which offers much
higher acceptance while preserving utility. We evaluate our method across
various datasets, demonstrating that we can achieve up to $2.5\times$ speedup
with comparable utility. Source code is available at
https://github.com/amir-zsh/PAD.
Ссылки и действия
Дополнительные ресурсы: