Reliable and Scalable Robot Policy Evaluation with Imperfect Simulators
2510.04354v1
cs.RO, cs.AI, cs.SY, eess.SY
2025-10-08
Авторы:
Apurva Badithela, David Snyder, Lihan Zha, Joseph Mikhail, Matthew O'Kelly, Anushri Dixit, Anirudha Majumdar
Abstract
Rapid progress in imitation learning, foundation models, and large-scale
datasets has led to robot manipulation policies that generalize to a wide-range
of tasks and environments. However, rigorous evaluation of these policies
remains a challenge. Typically in practice, robot policies are often evaluated
on a small number of hardware trials without any statistical assurances. We
present SureSim, a framework to augment large-scale simulation with relatively
small-scale real-world testing to provide reliable inferences on the real-world
performance of a policy. Our key idea is to formalize the problem of combining
real and simulation evaluations as a prediction-powered inference problem, in
which a small number of paired real and simulation evaluations are used to
rectify bias in large-scale simulation. We then leverage non-asymptotic mean
estimation algorithms to provide confidence intervals on mean policy
performance. Using physics-based simulation, we evaluate both diffusion policy
and multi-task fine-tuned \(\pi_0\) on a joint distribution of objects and
initial conditions, and find that our approach saves over \(20-25\%\) of
hardware evaluation effort to achieve similar bounds on policy performance.