Online Decision Making with Generative Action Sets
2509.25777v1
cs.LG, stat.ML, 68W20, 90B50, 91B06, 68T01, 68Q32, I.2.6
2025-10-02
Авторы:
Jianyu Xu, Vidhi Jain, Bryan Wilder, Aarti Singh
Abstract
With advances in generative AI, decision-making agents can now dynamically
create new actions during online learning, but action generation typically
incurs costs that must be balanced against potential benefits. We study an
online learning problem where an agent can generate new actions at any time
step by paying a one-time cost, with these actions becoming permanently
available for future use. The challenge lies in learning the optimal sequence
of two-fold decisions: which action to take and when to generate new ones,
further complicated by the triangular tradeoffs among exploitation, exploration
and $\textit{creation}$. To solve this problem, we propose a doubly-optimistic
algorithm that employs Lower Confidence Bounds (LCB) for action selection and
Upper Confidence Bounds (UCB) for action generation. Empirical evaluation on
healthcare question-answering datasets demonstrates that our approach achieves
favorable generation-quality tradeoffs compared to baseline strategies. From
theoretical perspectives, we prove that our algorithm achieves the optimal
regret of $O(T^{\frac{d}{d+2}}d^{\frac{d}{d+2}} + d\sqrt{T\log T})$, providing
the first sublinear regret bound for online learning with expanding action
spaces.