Reinforcement Learning in POMDP's via Direct Gradient Ascent

2512.02383v1 cs.LG 2025-12-04

Авторы:

Jonathan Baxter, Peter L. Bartlett

Abstract

This paper discusses theoretical and experimental aspects of gradient-based approaches to the direct optimization of policy performance in controlled POMDPs. We introduce GPOMDP, a REINFORCE-like algorithm for estimating an approximation to the gradient of the average reward as a function of the parameters of a stochastic policy. The algorithm's chief advantages are that it requires only a single sample path of the underlying Markov chain, it uses only one free parameter $β\in [0,1)$, which has a natural interpretation in terms of bias-variance trade-off, and it requires no knowledge of the underlying state. We prove convergence of GPOMDP and show how the gradient estimates produced by GPOMDP can be used in a conjugate-gradient procedure to find local optima of the average reward.

Ссылки и действия

Читать на arXiv Скачать PDF

Дополнительные ресурсы:

Reinforcement Learning in POMDP's via Direct Gradient Ascent

Авторы:

Abstract

Ссылки и действия

Связанные статьи

Edged Weisfeiler-Lehman Algorithm

When unlearning is free: leveraging low influence points to reduce computational...

DMAGT: Unveiling miRNA-Drug Associations by Integrating SMILES and RNA Sequence ...

Bridging Interpretability and Optimization: Provably Attribution-Weighted Actor-...

Enhancing Deep Deterministic Policy Gradients on Continuous Control Tasks with D...

Навигация