SaFeR-VLM: Toward Safety-aware Fine-grained Reasoning in Multimodal Models
2510.06871v2
cs.LG, cs.CV
2025-10-10
Авторы:
Huahui Yi, Kun Wang, Qiankun Li, Miao Yu, Liang Lin, Gongli Xi, Hao Wu, Xuming Hu, Kang Li, Yang Liu
Abstract
Multimodal Large Reasoning Models (MLRMs) demonstrate impressive cross-modal
reasoning but often amplify safety risks under adversarial or unsafe prompts, a
phenomenon we call the \textit{Reasoning Tax}. Existing defenses mainly act at
the output level and do not constrain the reasoning process, leaving models
exposed to implicit risks. In this paper, we propose SaFeR-VLM, a
safety-aligned reinforcement learning framework that embeds safety directly
into multimodal reasoning. The framework integrates four components: (I)
QI-Safe-10K, a curated dataset emphasizing safety-critical and
reasoning-sensitive cases; (II) safety-aware rollout, where unsafe generations
undergo reflection and correction instead of being discarded; (III) structured
reward modeling with multi-dimensional weighted criteria and explicit penalties
for hallucinations and contradictions; and (IV) GRPO optimization, which
reinforces both safe and corrected trajectories. This unified design shifts
safety from a passive safeguard to an active driver of reasoning, enabling
scalable and generalizable safety-aware reasoning. SaFeR-VLM further
demonstrates robustness against both explicit and implicit risks, supporting
dynamic and interpretable safety decisions beyond surface-level filtering.
SaFeR-VLM-3B achieves average performance $70.13$ and $78.97$ on safety and
helpfulness across six benchmarks, surpassing both same-scale and $>10\times$
larger models such as Skywork-R1V3-38B, Qwen2.5VL-72B, and GLM4.5V-106B.
Remarkably, SaFeR-VLM-7B benefits from its increased scale to surpass
GPT-5-mini and Gemini-2.5-Flash by \num{6.47} and \num{16.76} points
respectively on safety metrics, achieving this improvement without any
degradation in helpfulness performance. Our codes are available at
https://github.com/HarveyYi/SaFeR-VLM.
Ссылки и действия
Дополнительные ресурсы: