Preference-based Reinforcement Learning beyond Pairwise Comparisons: Benefits of Multiple Options
2510.18713v1
cs.LG, cs.AI, stat.ML
2025-10-23
Авторы:
Joongkyu Lee, Seouh-won Yi, Min-hwan Oh
Abstract
We study online preference-based reinforcement learning (PbRL) with the goal
of improving sample efficiency. While a growing body of theoretical work has
emerged-motivated by PbRL's recent empirical success, particularly in aligning
large language models (LLMs)-most existing studies focus only on pairwise
comparisons. A few recent works (Zhu et al., 2023, Mukherjee et al., 2024,
Thekumparampil et al., 2024) have explored using multiple comparisons and
ranking feedback, but their performance guarantees fail to improve-and can even
deteriorate-as the feedback length increases, despite the richer information
available. To address this gap, we adopt the Plackett-Luce (PL) model for
ranking feedback over action subsets and propose M-AUPO, an algorithm that
selects multiple actions by maximizing the average uncertainty within the
offered subset. We prove that M-AUPO achieves a suboptimality gap of
$\tilde{\mathcal{O}}\left( \frac{d}{T} \sqrt{ \sum_{t=1}^T \frac{1}{|S_t|}}
\right)$, where $T$ is the total number of rounds, $d$ is the feature
dimension, and $|S_t|$ is the size of the subset at round $t$. This result
shows that larger subsets directly lead to improved performance and, notably,
the bound avoids the exponential dependence on the unknown parameter's norm,
which was a fundamental limitation in most previous works. Moreover, we
establish a near-matching lower bound of $\Omega \left( \frac{d}{K \sqrt{T}}
\right)$, where $K$ is the maximum subset size. To the best of our knowledge,
this is the first theoretical result in PbRL with ranking feedback that
explicitly shows improved sample efficiency as a function of the subset size.
Ссылки и действия
Дополнительные ресурсы: