Achieve Performatively Optimal Policy for Performative Reinforcement Learning
2510.04430v1
cs.LG, math.OC
2025-10-08
Авторы:
Ziyi Chen, Heng Huang
Abstract
Performative reinforcement learning is an emerging dynamical decision making
framework, which extends reinforcement learning to the common applications
where the agent's policy can change the environmental dynamics. Existing works
on performative reinforcement learning only aim at a performatively stable (PS)
policy that maximizes an approximate value function. However, there is a
provably positive constant gap between the PS policy and the desired
performatively optimal (PO) policy that maximizes the original value function.
In contrast, this work proposes a zeroth-order Frank-Wolfe algorithm (0-FW)
algorithm with a zeroth-order approximation of the performative policy gradient
in the Frank-Wolfe framework, and obtains \textbf{the first polynomial-time
convergence to the desired PO} policy under the standard regularizer dominance
condition. For the convergence analysis, we prove two important properties of
the nonconvex value function. First, when the policy regularizer dominates the
environmental shift, the value function satisfies a certain gradient dominance
property, so that any stationary point (not PS) of the value function is a
desired PO. Second, though the value function has unbounded gradient, we prove
that all the sufficiently stationary points lie in a convex and compact policy
subspace $\Pi_{\Delta}$, where the policy value has a constant lower bound
$\Delta>0$ and thus the gradient becomes bounded and Lipschitz continuous.
Experimental results also demonstrate that our 0-FW algorithm is more effective
than the existing algorithms in finding the desired PO policy.
Ссылки и действия
Дополнительные ресурсы: