A Frequency-Domain Analysis of the Multi-Armed Bandit Problem: A New Perspective on the Exploration-Exploitation Trade-off
2510.08908v1
cs.LG, cs.AI, cs.IT, math.IT, math.OC, stat.ML, 68T05, 62L05, 94A12, I.2.6; G.3
2025-10-14
Авторы:
Di Zhang
Abstract
The stochastic multi-armed bandit (MAB) problem is one of the most
fundamental models in sequential decision-making, with the core challenge being
the trade-off between exploration and exploitation. Although algorithms such as
Upper Confidence Bound (UCB) and Thompson Sampling, along with their regret
theories, are well-established, existing analyses primarily operate from a
time-domain and cumulative regret perspective, struggling to characterize the
dynamic nature of the learning process. This paper proposes a novel
frequency-domain analysis framework, reformulating the bandit process as a
signal processing problem. Within this framework, the reward estimate of each
arm is viewed as a spectral component, with its uncertainty corresponding to
the component's frequency, and the bandit algorithm is interpreted as an
adaptive filter. We construct a formal Frequency-Domain Bandit Model and prove
the main theorem: the confidence bound term in the UCB algorithm is equivalent
in the frequency domain to a time-varying gain applied to uncertain spectral
components, a gain inversely proportional to the square root of the visit
count. Based on this, we further derive finite-time dynamic bounds concerning
the exploration rate decay. This theory not only provides a novel and intuitive
physical interpretation for classical algorithms but also lays a rigorous
theoretical foundation for designing next-generation algorithms with adaptive
parameter adjustment.