Collaborative Large Language Model Inference via Resource-Aware Parallel Speculative Decoding
2511.01695v2
cs.LG, eess.SP
2025-11-06
Авторы:
Jungyeon Koh, Hyun Jong Yang
Abstract
The growing demand for on-device large language model (LLM) inference
highlights the need for efficient mobile edge computing (MEC) solutions,
especially in resource-constrained settings. Speculative decoding offers a
promising solution by partitioning token generation between a lightweight draft
model on mobile devices and a powerful target model on edge servers, but
suffers from communication overhead and asynchronous delays. This paper is the
first to propose a unified framework that jointly optimizes user association
and resource allocation (UARA) to support efficient parallel speculative
decoding. We solve the UARA problem using a multi-agent deep reinforcement
learning algorithm. To evaluate our approach under realistic conditions, we
conduct experiments using the Sionna simulator. Results show that our method
achieves up to 28.0% and an average of 23.7% reduction in end-to-end latency
without compromising inference accuracy, enabling scalable and low-latency LLM
services in MEC systems.
Ссылки и действия
Дополнительные ресурсы: