SafeR-CLIP: Mitigating NSFW Content in Vision-Language Models While Preserving Pre-Trained Knowledge

2511.16743v1 cs.CV, cs.AI, cs.LG 2025-11-24

Авторы:

Adeel Yousaf, Joseph Fioresi, James Beetham, Amrit Singh Bedi, Mubarak Shah

Abstract

Improving the safety of vision-language models like CLIP via fine-tuning often comes at a steep price, causing significant drops in their generalization performance. We find this trade-off stems from rigid alignment strategies that force unsafe concepts toward single, predefined safe targets, disrupting the model's learned semantic structure. To address this, we propose a proximity-aware approach: redirecting unsafe concepts to their semantically closest safe alternatives to minimize representational change. We introduce SaFeR-CLIP, a fine-tuning framework that applies this principle of minimal intervention. SaFeR-CLIP successfully reconciles safety and performance, recovering up to 8.0% in zero-shot accuracy over prior methods while maintaining robust safety. To support more rigorous evaluation, we also contribute NSFW-Caps, a new benchmark of 1,000 highly-aligned pairs for testing safety under distributional shift. Our work shows that respecting the geometry of pretrained representations is key to achieving safety without sacrificing performance.

Ссылки и действия

Читать на arXiv Скачать PDF

Дополнительные ресурсы:

SafeR-CLIP: Mitigating NSFW Content in Vision-Language Models While Preserving Pre-Trained Knowledge

Авторы:

Abstract

Ссылки и действия

Связанные статьи

PyroFocus: A Deep Learning Approach to Real-Time Wildfire Detection in Multispec...

ProtoEFNet: Dynamic Prototype Learning for Inherently Interpretable Ejection Fra...

GalaxyDiT: Efficient Video Generation with Guidance Alignment and Adaptive Proxy...

Divide, then Ground: Adapting Frame Selection to Query Types for Long-Form Video...

PSA: Pyramid Sparse Attention for Efficient Video Understanding and Generation

Навигация