PRO: Enabling Precise and Robust Text Watermark for Open-Source LLMs
2510.23891v1
cs.CR, cs.AI, cs.LG
2025-10-30
Авторы:
Jiaqi Xue, Yifei Zhao, Mansour Al Ghanim, Shangqian Gao, Ruimin Sun, Qian Lou, Mengxin Zheng
Abstract
Text watermarking for large language models (LLMs) enables model owners to
verify text origin and protect intellectual property. While watermarking
methods for closed-source LLMs are relatively mature, extending them to
open-source models remains challenging, as developers cannot control the
decoding process. Consequently, owners of open-source LLMs lack practical means
to verify whether text was generated by their models. A core difficulty lies in
embedding watermarks directly into model weights without hurting detectability.
A promising idea is to distill watermarks from a closed-source model into an
open one, but this suffers from (i) poor detectability due to mismatch between
learned and predefined patterns, and (ii) fragility to downstream modifications
such as fine-tuning or model merging. To overcome these limitations, we propose
PRO, a Precise and Robust text watermarking method for open-source LLMs. PRO
jointly trains a watermark policy model with the LLM, producing patterns that
are easier for the model to learn and more consistent with detection criteria.
A regularization term further simulates downstream perturbations and penalizes
degradation in watermark detectability, ensuring robustness under model edits.
Experiments on open-source LLMs (e.g., LLaMA-3.2, LLaMA-3, Phi-2) show that PRO
substantially improves both watermark detectability and resilience to model
modifications.
Ссылки и действия
Дополнительные ресурсы: