Can Reasoning Models Obfuscate Reasoning? Stress-Testing Chain-of-Thought Monitorability
2510.19851v1
cs.CR, cs.AI
2025-10-26
Авторы:
Artur Zolkowski, Wen Xing, David Lindner, Florian Tramèr, Erik Jenner
Abstract
Recent findings suggest that misaligned models may exhibit deceptive
behavior, raising concerns about output trustworthiness. Chain-of-thought (CoT)
is a promising tool for alignment monitoring: when models articulate their
reasoning faithfully, monitors can detect and mitigate harmful behaviors before
undesirable outcomes occur. However, a key uncertainty is: Can models obfuscate
their CoT in order to pursue hidden adversarial objectives while evading
detection? To answer this question and thus stress-test CoT monitorability, we
develop a composable and quantifiable taxonomy of prompts to elicit CoT
obfuscation. We evaluate both internal CoT (reasoning traces) and external CoT
(prompted reasoning in outputs) using toy tasks and more realistic environments
in SHADE-Arena. We show that: (i) CoT monitoring performs accurately and
efficiently without obfuscation pressure. (ii) Under strong obfuscation
pressure, some models successfully complete adversarial tasks while evading
detection. (iii) Models do not obfuscate their internal CoT as much as their
external CoT (under prompt pressure). These results suggest that while CoT
provides valuable oversight in benign settings, robust deployment requires
model-specific stress-testing of monitorability.
Ссылки и действия
Дополнительные ресурсы: