Expressive Value Learning for Scalable Offline Reinforcement Learning
2510.08218v1
cs.LG, cs.AI, I.2.6
2025-10-11
Авторы:
Nicolas Espinosa-Dice, Kiante Brantley, Wen Sun
Abstract
Reinforcement learning (RL) is a powerful paradigm for learning to make
sequences of decisions. However, RL has yet to be fully leveraged in robotics,
principally due to its lack of scalability. Offline RL offers a promising
avenue by training agents on large, diverse datasets, avoiding the costly
real-world interactions of online RL. Scaling offline RL to increasingly
complex datasets requires expressive generative models such as diffusion and
flow matching. However, existing methods typically depend on either
backpropagation through time (BPTT), which is computationally prohibitive, or
policy distillation, which introduces compounding errors and limits scalability
to larger base policies. In this paper, we consider the question of how to
develop a scalable offline RL approach without relying on distillation or
backpropagation through time. We introduce Expressive Value Learning for
Offline Reinforcement Learning (EVOR): a scalable offline RL approach that
integrates both expressive policies and expressive value functions. EVOR learns
an optimal, regularized Q-function via flow matching during training. At
inference-time, EVOR performs inference-time policy extraction via rejection
sampling against the expressive value function, enabling efficient
optimization, regularization, and compute-scalable search without retraining.
Empirically, we show that EVOR outperforms baselines on a diverse set of
offline RL tasks, demonstrating the benefit of integrating expressive value
learning into offline RL.
Ссылки и действия
Дополнительные ресурсы: