Power Constrained Nonstationary Bandits with Habituation and Recovery Dynamics

2511.02944v1 cs.LG, cs.AI, math.OC, stat.ML 2025-11-07

Авторы:

Fengxu Li, Stephanie M. Carpenter, Matthew P. Buman, Yonatan Mintz

Abstract

A common challenge for decision makers is selecting actions whose rewards are unknown and evolve over time based on prior policies. For instance, repeated use may reduce an action's effectiveness (habituation), while inactivity may restore it (recovery). These nonstationarities are captured by the Reducing or Gaining Unknown Efficacy (ROGUE) bandit framework, which models real-world settings such as behavioral health interventions. While existing algorithms can compute sublinear regret policies to optimize these settings, they may not provide sufficient exploration due to overemphasis on exploitation, limiting the ability to estimate population-level effects. This is a challenge of particular interest in micro-randomized trials (MRTs) that aid researchers in developing just-in-time adaptive interventions that have population-level effects while still providing personalized recommendations to individuals. In this paper, we first develop ROGUE-TS, a Thompson Sampling algorithm tailored to the ROGUE framework, and provide theoretical guarantees of sublinear regret. We then introduce a probability clipping procedure to balance personalization and population-level learning, with quantified trade-off that balances regret and minimum exploration probability. Validation on two MRT datasets concerning physical activity promotion and bipolar disorder treatment shows that our methods both achieve lower regret than existing approaches and maintain high statistical power through the clipping procedure without significantly increasing regret. This enables reliable detection of treatment effects while accounting for individual behavioral dynamics. For researchers designing MRTs, our framework offers practical guidance on balancing personalization with statistical validity.

Ссылки и действия

Читать на arXiv Скачать PDF

Дополнительные ресурсы:

Power Constrained Nonstationary Bandits with Habituation and Recovery Dynamics

Авторы:

Abstract

Ссылки и действия

Связанные статьи

ECPv2: Fast, Efficient, and Scalable Global Optimization of Lipschitz Functions

Transformers Provably Learn Chain-of-Thought Reasoning with Length Generalizatio...

Implicit Bias of Per-sample Adam on Separable Data: Departure from the Full-batc...

Implicit Bias of Per-sample Adam on Separable Data: Departure from the Full-batc...

Seesaw: Accelerating Training by Balancing Learning Rate and Batch Size Scheduli...

Навигация