Convergence of off-policy TD(0) with linear function approximation for reversible Markov chains
2510.25514v1
stat.ML, cs.LG
2025-10-31
Авторы:
Maik Overmars, Jasper Goseling, Richard Boucherie
Abstract
We study the convergence of off-policy TD(0) with linear function
approximation when used to approximate the expected discounted reward in a
Markov chain. It is well known that the combination of off-policy learning and
function approximation can lead to divergence of the algorithm. Existing
results for this setting modify the algorithm, for instance by reweighing the
updates using importance sampling. This establishes convergence at the expense
of additional complexity. In contrast, our approach is to analyse the standard
algorithm, but to restrict our attention to the class of reversible Markov
chains. We demonstrate convergence under this mild reversibility condition on
the structure of the chain, which in many applications can be assumed using
domain knowledge. In particular, we establish a convergence guarantee under an
upper bound on the discount factor in terms of the difference between the
on-policy and off-policy process. This improves upon known results in the
literature that state that convergence holds for a sufficiently small discount
factor by establishing an explicit bound. Convergence is with probability one
and achieves projected Bellman error equal to zero. To obtain these results, we
adapt the stochastic approximation framework that was used by Tsitsiklis and
Van Roy [1997 for the on-policy case, to the off-policy case. We illustrate our
results using different types of reversible Markov chains, such as
one-dimensional random walks and random walks on a weighted graph.
Ссылки и действия
Дополнительные ресурсы: