Sandbagging in a Simple Survival Bandit Problem
2509.26239v1
cs.LG, cs.AI, stat.ML
2025-10-02
Авторы:
Joel Dyer, Daniel Jarne Ornia, Nicholas Bishop, Anisoara Calinescu, Michael Wooldridge
Abstract
Evaluating the safety of frontier AI systems is an increasingly important
concern, helping to measure the capabilities of such models and identify risks
before deployment. However, it has been recognised that if AI agents are aware
that they are being evaluated, such agents may deliberately hide dangerous
capabilities or intentionally demonstrate suboptimal performance in
safety-related tasks in order to be released and to avoid being deactivated or
retrained. Such strategic deception - often known as "sandbagging" - threatens
to undermine the integrity of safety evaluations. For this reason, it is of
value to identify methods that enable us to distinguish behavioural patterns
that demonstrate a true lack of capability from behavioural patterns that are
consistent with sandbagging. In this paper, we develop a simple model of
strategic deception in sequential decision-making tasks, inspired by the
recently developed survival bandit framework. We demonstrate theoretically that
this problem induces sandbagging behaviour in optimal rational agents, and
construct a statistical test to distinguish between sandbagging and
incompetence from sequences of test scores. In simulation experiments, we
investigate the reliability of this test in allowing us to distinguish between
such behaviours in bandit models. This work aims to establish a potential
avenue for developing robust statistical procedures for use in the science of
frontier model evaluations.
Ссылки и действия
Дополнительные ресурсы: