SafeVision: Efficient Image Guardrail with Robust Policy Adherence and Explainability
2510.23960v1
cs.CV, cs.AI, cs.CR
2025-10-30
Авторы:
Peiyang Xu, Minzhou Pan, Zhaorun Chen, Shuang Yang, Chaowei Xiao, Bo Li
Abstract
With the rapid proliferation of digital media, the need for efficient and
transparent safeguards against unsafe content is more critical than ever.
Traditional image guardrail models, constrained by predefined categories, often
misclassify content due to their pure feature-based learning without semantic
reasoning. Moreover, these models struggle to adapt to emerging threats,
requiring costly retraining for new threats. To address these limitations, we
introduce SafeVision, a novel image guardrail that integrates human-like
reasoning to enhance adaptability and transparency. Our approach incorporates
an effective data collection and generation framework, a policy-following
training pipeline, and a customized loss function. We also propose a diverse QA
generation and training strategy to enhance learning effectiveness. SafeVision
dynamically aligns with evolving safety policies at inference time, eliminating
the need for retraining while ensuring precise risk assessments and
explanations. Recognizing the limitations of existing unsafe image benchmarks,
which either lack granularity or cover limited risks, we introduce VisionHarm,
a high-quality dataset comprising two subsets: VisionHarm Third-party
(VisionHarm-T) and VisionHarm Comprehensive(VisionHarm-C), spanning diverse
harmful categories. Through extensive experiments, we show that SafeVision
achieves state-of-the-art performance on different benchmarks. SafeVision
outperforms GPT-4o by 8.6% on VisionHarm-T and by 15.5% on VisionHarm-C, while
being over 16x faster. SafeVision sets a comprehensive, policy-following, and
explainable image guardrail with dynamic adaptation to emerging threats.
Ссылки и действия
Дополнительные ресурсы: