Accelerating LLM Inference with Precomputed Query Storage
2509.25919v1
cs.DC, cs.AI
2025-10-02
Авторы:
Jay H. Park, Youngju Cho, Choungsol Lee, Moonwook Oh, Euiseong Seo
Abstract
Large language model (LLM) inference often suffers from high latency,
particularly in resource-constrained environments such as on-device or edge
deployments. To address this challenge, we present StorInfer, a novel
storage-assisted LLM inference system that accelerates response time by
precomputing and storing predictable query-response pairs offline. When a user
query semantically matches a precomputed query, StorInfer bypasses expensive
GPU inference and instantly returns the stored response, significantly reducing
latency and compute costs. To maximize coverage and effectiveness, StorInfer
employs an LLM-driven generator that adaptively produces diverse and
deduplicated queries based on a given knowledge base. This is achieved via two
techniques: adaptive query masking, which prevents regeneration of similar
queries, and adaptive sampling, which dynamically tunes generation parameters
to promote semantic diversity. The resulting query-response pairs are embedded
and indexed using a disk-backed vector database to enable fast,
similarity-based retrieval at runtime. Using this approach, we generated 150K
unique precomputed pairs (taking up to 830 MB of storage space), achieving up
to 17.3% latency reduction with no loss in response quality. Our evaluation
across multiple QA datasets demonstrates the practicality and scalability of
storage-assisted inference, especially in scenarios with predictable query
distributions. StorInfer highlights a promising direction in leveraging storage
as a primary enabler for efficient, low-latency LLM deployment.
Ссылки и действия
Дополнительные ресурсы: