Foresighted Online Policy Optimization with Interference
2510.15273v1
stat.ML, cs.LG, math.ST, stat.ME, stat.TH
2025-10-21
Авторы:
Liner Xiang, Jiayi Wang, Hengrui Cai
Abstract
Contextual bandits, which leverage the baseline features of sequentially
arriving individuals to optimize cumulative rewards while balancing exploration
and exploitation, are critical for online decision-making. Existing approaches
typically assume no interference, where each individual's action affects only
their own reward. Yet, such an assumption can be violated in many practical
scenarios, and the oversight of interference can lead to short-sighted policies
that focus solely on maximizing the immediate outcomes for individuals, which
further results in suboptimal decisions and potentially increased regret over
time. To address this significant gap, we introduce the foresighted online
policy with interference (FRONT) that innovatively considers the long-term
impact of the current decision on subsequent decisions and rewards. The
proposed FRONT method employs a sequence of exploratory and exploitative
strategies to manage the intricacies of interference, ensuring robust parameter
inference and regret minimization. Theoretically, we establish a tail bound for
the online estimator and derive the asymptotic distribution of the parameters
of interest under suitable conditions on the interference network. We further
show that FRONT attains sublinear regret under two distinct definitions,
capturing both the immediate and consequential impacts of decisions, and we
establish these results with and without statistical inference. The
effectiveness of FRONT is further demonstrated through extensive simulations
and a real-world application to urban hotel profits.