TapOut: A Bandit-Based Approach to Dynamic Speculative Decoding
2511.02017v1
cs.LG, cs.CL
2025-11-06
Авторы:
Aditya Sridhar, Nish Sinnadurai, Sean Lie, Vithursan Thangarasa
Abstract
Speculative decoding accelerates LLMs by using a lightweight draft model to
generate tokens autoregressively before verifying them in parallel with a
larger target model. However, determining the optimal number of tokens to draft
remains a key challenge limiting the approach's effectiveness. Dynamic
speculative decoding aims to intelligently decide how many tokens to draft to
achieve maximum speedups. Existing methods often rely on hand-tuned, sensitive
thresholds (e.g., token entropy), which are costly to set and generalize poorly
across models and domains. We propose TapOut, an online, training-free,
plug-and-play algorithm for dynamic speculation policy selection using
multi-armed bandits. Our approach employs a meta-algorithm that selects among
multiple parameter-free dynamic speculation strategies based on past reward and
exploration. We conduct extensive experiments across diverse model pairs and
datasets, showing that TapOut achieves competitive or superior speedups
compared to well-established dynamic speculation baselines without any
hyperparameter tuning.
Ссылки и действия
Дополнительные ресурсы: