When to Reason: Semantic Router for vLLM
2510.08731v1
cs.ET, cs.AI, cs.CL, cs.SY, eess.SY
2025-10-14
Авторы:
Chen Wang, Xunzhuo Liu, Yuhan Liu, Yue Zhu, Xiangxi Mo, Junchen Jiang, Huamin Chen
Abstract
Large Language Models (LLMs) demonstrate substantial accuracy gains when
augmented with reasoning modes such as chain-of-thought and inference-time
scaling. However, reasoning also incurs significant costs in inference latency
and token usage, with environmental and financial impacts, which are
unnecessary for many simple prompts. We present a semantic router that
classifies queries based on their reasoning requirements and selectively
applies reasoning only when beneficial. Our approach achieves a 10.2 percentage
point improvement in accuracy on the MMLU-Pro benchmark while reducing response
latency by 47.1% and token consumption by 48.5% compared to direct inference
with vLLM. These results demonstrate that semantic routing offers an effective
mechanism for striking a balance between accuracy and efficiency in open-source
LLM serving systems