The Lossy Horizon: Error-Bounded Predictive Coding for Lossy Text Compression (Episode I)
2510.22207v1
cs.LG, cs.CL, cs.IT, math.IT, 94A08, 68P30, 68T50, E.4; I.2.7; I.2.7
2025-10-29
Авторы:
Nnamdi Aghanya, Jun Li, Kewei Wang
Abstract
Large Language Models (LLMs) can achieve near-optimal lossless compression by
acting as powerful probability models. We investigate their use in the lossy
domain, where reconstruction fidelity is traded for higher compression ratios.
This paper introduces Error-Bounded Predictive Coding (EPC), a lossy text codec
that leverages a Masked Language Model (MLM) as a decompressor. Instead of
storing a subset of original tokens, EPC allows the model to predict masked
content and stores minimal, rank-based corrections only when the model's top
prediction is incorrect. This creates a residual channel that offers continuous
rate-distortion control. We compare EPC to a simpler Predictive Masking (PM)
baseline and a transform-based Vector Quantisation with a Residual Patch
(VQ+RE) approach. Through an evaluation that includes precise bit accounting
and rate-distortion analysis, we demonstrate that EPC consistently dominates
PM, offering superior fidelity at a significantly lower bit rate by more
efficiently utilising the model's intrinsic knowledge.