KV Cache Transform Coding for Compact Storage in LLM Inference
2511.01815v1
cs.CL, cs.AI, cs.LG
2025-11-06
Авторы:
Konrad Staniszewski, Adrian Łańcucki
Abstract
Serving large language models (LLMs) at scale necessitates efficient
key-value (KV) cache management. KV caches can be reused across conversation
turns via shared-prefix prompts that are common in iterative code editing and
chat. However, stale caches consume scarce GPU memory, require offloading, or
force recomputation. We present KVTC, a lightweight transform coder that
compresses KV caches for compact on-GPU and off-GPU storage. Drawing on
classical media compression, KVTC combines PCA-based feature decorrelation,
adaptive quantization, and entropy coding. It requires only a brief initial
calibration and leaves model parameters unchanged. By exploiting redundancies
in KV caches, KVTC achieves up to 20$\times$ compression while maintaining
reasoning and long-context accuracy, and 40$\times$ or higher for specific use
cases. We test KVTC with Llama 3, Mistral NeMo, and R1-Qwen 2.5 models across
benchmarks including AIME25, LiveCodeBench, GSM8K, MMLU, Qasper, RULER, and
MATH-500. It consistently outperforms inference-time baselines such as token
eviction, quantization, and SVD-based methods, while achieving higher
compression ratios. These results support KVTC as a practical building block
for memory-efficient LLM serving with reusable KV caches.
Ссылки и действия
Дополнительные ресурсы: