UpSafe$^\circ$C: Upcycling for Controllable Safety in Large Language Models
2510.02194v1
cs.AI, cs.CR, cs.LG
2025-10-04
Авторы:
Yuhao Sun, Zhuoer Xu, Shiwen Cui, Kun Yang, Lingyun Yu, Yongdong Zhang, Hongtao Xie
Abstract
Large Language Models (LLMs) have achieved remarkable progress across a wide
range of tasks, but remain vulnerable to safety risks such as harmful content
generation and jailbreak attacks. Existing safety techniques -- including
external guardrails, inference-time guidance, and post-training alignment --
each face limitations in balancing safety, utility, and controllability. In
this work, we propose UpSafe$^\circ$C, a unified framework for enhancing LLM
safety through safety-aware upcycling. Our approach first identifies
safety-critical layers and upcycles them into a sparse Mixture-of-Experts (MoE)
structure, where the router acts as a soft guardrail that selectively activates
original MLPs and added safety experts. We further introduce a two-stage SFT
strategy to strengthen safety discrimination while preserving general
capabilities. To enable flexible control at inference time, we introduce a
safety temperature mechanism, allowing dynamic adjustment of the trade-off
between safety and utility. Experiments across multiple benchmarks, base model,
and model scales demonstrate that UpSafe$^\circ$C achieves robust safety
improvements against harmful and jailbreak inputs, while maintaining
competitive performance on general tasks. Moreover, analysis shows that safety
temperature provides fine-grained inference-time control that achieves the
Pareto-optimal frontier between utility and safety. Our results highlight a new
direction for LLM safety: moving from static alignment toward dynamic, modular,
and inference-aware control.
Ссылки и действия
Дополнительные ресурсы: