HiSpec: Hierarchical Speculative Decoding for LLMs
2510.01336v1
cs.CL, cs.AI, cs.LG
2025-10-04
Авторы:
Avinash Kumar, Sujay Sanghavi, Poulami Das
Abstract
Speculative decoding accelerates LLM inference by using a smaller draft model
to speculate tokens that a larger target model verifies. Verification is often
the bottleneck (e.g. verification is $4\times$ slower than token generation
when a 3B model speculates for a 70B target model), but most prior works focus
only on accelerating drafting. $\textit{``Intermediate"}$ verification reduces
verification time by discarding inaccurate draft tokens early, but existing
methods incur substantial training overheads in incorporating the intermediate
verifier, increase the memory footprint to orchestrate the intermediate
verification step, and compromise accuracy by relying on approximate
heuristics.
We propose $\underline{\textit{Hi}}\textit{erarchical
}\underline{\textit{Spec}}\textit{ulative Decoding (HiSpec)}$, a framework for
high-throughput speculative decoding that exploits $\textit{early-exit (EE)
models}$ for low-overhead intermediate verification. EE models allow tokens to
exit early by skipping layer traversal and are explicitly trained so that
hidden states at selected layers can be interpreted, making them uniquely
suited for intermediate verification without drastically increasing compute and
memory overheads. To improve resource-efficiency even further, we design a
methodology that enables HiSpec to re-use key-value caches and hidden states
between the draft, intermediate verifier, and target models. To maintain
accuracy, HiSpec periodically validates the draft tokens accepted by the
intermediate verifier against the target model. Our evaluations using various
representative benchmarks and models show that HiSpec improves throughput by
1.28$\times$ on average and by up to 2.01$\times$ compared to the baseline
single-layer speculation without compromising accuracy.
Ссылки и действия
Дополнительные ресурсы: