Refusal Falls off a Cliff: How Safety Alignment Fails in Reasoning?

2510.06036v1 cs.AI, cs.CR 2025-10-09

Авторы:

Qingyu Yin, Chak Tou Leong, Linyi Yang, Wenxuan Huang, Wenjie Li, Xiting Wang, Jaehong Yoon, YunXing, XingYu, Jinjin Gu

Abstract

Large reasoning models (LRMs) with multi-step reasoning capabilities have shown remarkable problem-solving abilities, yet they exhibit concerning safety vulnerabilities that remain poorly understood. In this work, we investigate why safety alignment fails in reasoning models through a mechanistic interpretability lens. Using a linear probing approach to trace refusal intentions across token positions, we discover a striking phenomenon termed as \textbf{refusal cliff}: many poorly-aligned reasoning models correctly identify harmful prompts and maintain strong refusal intentions during their thinking process, but experience a sharp drop in refusal scores at the final tokens before output generation. This suggests that these models are not inherently unsafe; rather, their refusal intentions are systematically suppressed. Through causal intervention analysis, we identify a sparse set of attention heads that negatively contribute to refusal behavior. Ablating just 3\% of these heads can reduce attack success rates below 10\%. Building on these mechanistic insights, we propose \textbf{Cliff-as-a-Judge}, a novel data selection method that identifies training examples exhibiting the largest refusal cliff to efficiently repair reasoning models' safety alignment. This approach achieves comparable safety improvements using only 1.7\% of the vanilla safety training data, demonstrating a less-is-more effect in safety alignment.

Ссылки и действия

Читать на arXiv Скачать PDF

Дополнительные ресурсы:

Refusal Falls off a Cliff: How Safety Alignment Fails in Reasoning?

Авторы:

Abstract

Ссылки и действия

Связанные статьи

ASTRIDE: A Security Threat Modeling Platform for Agentic-AI Applications

Reasoning Under Pressure: How do Training Incentives Influence Chain-of-Thought ...

LLM-CSEC: Empirical Evaluation of Security in C/C++ Code Generated by Large Lang...

Mapping Smarter, Not Harder: A Test-Time Reinforcement Learning Agent That Impro...

Evaluating and Mitigating LLM-as-a-judge Bias in Communication Systems

Навигация