Optimal Stopping vs Best-of-$N$ for Inference Time Optimization
2510.01394v1
cs.LG, cs.CL
2025-10-04
Авторы:
Yusuf Kalayci, Vinod Raman, Shaddin Dughmi
Abstract
Large language model (LLM) generation often requires balancing output quality
against inference cost, especially when using multiple generations. We
introduce a new framework for inference-time optimization based on the
classical Pandora's Box problem. Viewing each generation as opening a costly
"box" with random reward, we develop algorithms that decide when to stop
generating without knowing the underlying reward distribution. Our first
contribution is a UCB-style Pandora's Box algorithm, which achieves performance
that is provably close to Weitzman's algorithm, the optimal strategy when the
distribution is known. We further adapt this method to practical LLM settings
by addressing reward scaling across prompts via a Bradley-Terry inspired
transformation. This leads to an adaptive inference-time optimization method
that normalizes rewards and learns stopping thresholds on the fly. Experiments
on the AlpacaFarm and HH-RLHF datasets, using multiple LLM-reward model pairs,
show that our adaptive strategy can obtain the same performance as non-adaptive
Best-of-N sampling while requiring 15-35 percent fewer generations on average.
Our results establish a principled bridge between optimal stopping theory and
inference-time scaling, providing both theoretical performance bounds and
practical efficiency gains for LLM deployment.
Ссылки и действия
Дополнительные ресурсы: