Prompt-Based Safety Guidance Is Ineffective for Unlearned Text-to-Image Diffusion Models
2511.04834v1
cs.LG, cs.AI, cs.CV
2025-11-11
Авторы:
Jiwoo Shin, Byeonghu Na, Mina Kang, Wonhyeok Choi, Il-chul Moon
Abstract
Recent advances in text-to-image generative models have raised concerns about
their potential to produce harmful content when provided with malicious input
text prompts. To address this issue, two main approaches have emerged: (1)
fine-tuning the model to unlearn harmful concepts and (2) training-free
guidance methods that leverage negative prompts. However, we observe that
combining these two orthogonal approaches often leads to marginal or even
degraded defense performance. This observation indicates a critical
incompatibility between two paradigms, which hinders their combined
effectiveness. In this work, we address this issue by proposing a conceptually
simple yet experimentally robust method: replacing the negative prompts used in
training-free methods with implicit negative embeddings obtained through
concept inversion. Our method requires no modification to either approach and
can be easily integrated into existing pipelines. We experimentally validate
its effectiveness on nudity and violence benchmarks, demonstrating consistent
improvements in defense success rate while preserving the core semantics of
input prompts.
Ссылки и действия
Дополнительные ресурсы: