Learning Upper Lower Value Envelopes to Shape Online RL: A Principled Approach
2510.19528v1
stat.ML, cs.LG, math.ST, stat.TH
2025-10-24
Авторы:
Sebastian Reboul, Hélène Halconruy, Randal Douc
Abstract
We investigate the fundamental problem of leveraging offline data to
accelerate online reinforcement learning - a direction with strong potential
but limited theoretical grounding. Our study centers on how to learn and apply
value envelopes within this context. To this end, we introduce a principled
two-stage framework: the first stage uses offline data to derive upper and
lower bounds on value functions, while the second incorporates these learned
bounds into online algorithms. Our method extends prior work by decoupling the
upper and lower bounds, enabling more flexible and tighter approximations. In
contrast to approaches that rely on fixed shaping functions, our envelopes are
data-driven and explicitly modeled as random variables, with a filtration
argument ensuring independence across phases. The analysis establishes
high-probability regret bounds determined by two interpretable quantities,
thereby providing a formal bridge between offline pre-training and online
fine-tuning. Empirical results on tabular MDPs demonstrate substantial regret
reductions compared with both UCBVI and prior methods.