BeLLMan: Controlling LLM Congestion
2510.15330v1
cs.DC, cs.AI, cs.CL, cs.NI
2025-10-21
Авторы:
Tella Rajashekhar Reddy, Atharva Deshmukh, Karan Tandon, Rohan Gandhi, Anjaly Parayil, Debopam Bhattacherjee
Abstract
Large language model (LLM) applications are blindfolded to the infrastructure
underneath and generate tokens autoregressively, indifferent to the system
load, thus risking inferencing latency inflation and poor user experience. Our
first-cut controller, named beLLMan, enables the LLM infrastructure to actively
and progressively signal the first-party LLM application to adjust the output
length in response to changing system load. On a real testbed with H100 GPUs,
beLLMan helps keep inferencing latency under control (upto 8X lower end-to-end
latency) and reduces energy consumption by 25% (while serving 19% more
requests) during periods of congestion for a summarization workload.