Power Constrained Nonstationary Bandits with Habituation and Recovery Dynamics
2511.02944v1
cs.LG, cs.AI, math.OC, stat.ML
2025-11-07
Авторы:
Fengxu Li, Stephanie M. Carpenter, Matthew P. Buman, Yonatan Mintz
Abstract
A common challenge for decision makers is selecting actions whose rewards are
unknown and evolve over time based on prior policies. For instance, repeated
use may reduce an action's effectiveness (habituation), while inactivity may
restore it (recovery). These nonstationarities are captured by the Reducing or
Gaining Unknown Efficacy (ROGUE) bandit framework, which models real-world
settings such as behavioral health interventions. While existing algorithms can
compute sublinear regret policies to optimize these settings, they may not
provide sufficient exploration due to overemphasis on exploitation, limiting
the ability to estimate population-level effects. This is a challenge of
particular interest in micro-randomized trials (MRTs) that aid researchers in
developing just-in-time adaptive interventions that have population-level
effects while still providing personalized recommendations to individuals. In
this paper, we first develop ROGUE-TS, a Thompson Sampling algorithm tailored
to the ROGUE framework, and provide theoretical guarantees of sublinear regret.
We then introduce a probability clipping procedure to balance personalization
and population-level learning, with quantified trade-off that balances regret
and minimum exploration probability. Validation on two MRT datasets concerning
physical activity promotion and bipolar disorder treatment shows that our
methods both achieve lower regret than existing approaches and maintain high
statistical power through the clipping procedure without significantly
increasing regret. This enables reliable detection of treatment effects while
accounting for individual behavioral dynamics. For researchers designing MRTs,
our framework offers practical guidance on balancing personalization with
statistical validity.