Adapting to Stochastic and Adversarial Losses in Episodic MDPs with Aggregate Bandit Feedback
2510.17103v1
cs.LG, stat.ML
2025-10-22
Авторы:
Shinji Ito, Kevin Jamieson, Haipeng Luo, Arnab Maiti, Taira Tsuchiya
Abstract
We study online learning in finite-horizon episodic Markov decision processes
(MDPs) under the challenging aggregate bandit feedback model, where the learner
observes only the cumulative loss incurred in each episode, rather than
individual losses at each state-action pair. While prior work in this setting
has focused exclusively on worst-case analysis, we initiate the study of
best-of-both-worlds (BOBW) algorithms that achieve low regret in both
stochastic and adversarial environments. We propose the first BOBW algorithms
for episodic tabular MDPs with aggregate bandit feedback. In the case of known
transitions, our algorithms achieve $O(\log T)$ regret in stochastic settings
and ${O}(\sqrt{T})$ regret in adversarial ones. Importantly, we also establish
matching lower bounds, showing the optimality of our algorithms in this
setting. We further extend our approach to unknown-transition settings by
incorporating confidence-based techniques. Our results rely on a combination of
FTRL over occupancy measures, self-bounding techniques, and new loss estimators
inspired by recent advances in online shortest path problems. Along the way, we
also provide the first individual-gap-dependent lower bounds and demonstrate
near-optimal BOBW algorithms for shortest path problems with bandit feedback.
Ссылки и действия
Дополнительные ресурсы: