Large Language Models Are Effective Code Watermarkers
2510.11251v1
cs.CR, cs.AI, cs.LG
2025-10-15
Авторы:
Rui Xu, Jiawei Chen, Zhaoxia Yin, Cong Kong, Xinpeng Zhang
Abstract
The widespread use of large language models (LLMs) and open-source code has
raised ethical and security concerns regarding the distribution and attribution
of source code, including unauthorized redistribution, license violations, and
misuse of code for malicious purposes. Watermarking has emerged as a promising
solution for source attribution, but existing techniques rely heavily on
hand-crafted transformation rules, abstract syntax tree (AST) manipulation, or
task-specific training, limiting their scalability and generality across
languages. Moreover, their robustness against attacks remains limited. To
address these limitations, we propose CodeMark-LLM, an LLM-driven watermarking
framework that embeds watermark into source code without compromising its
semantics or readability. CodeMark-LLM consists of two core components: (i)
Semantically Consistent Embedding module that applies functionality-preserving
transformations to encode watermark bits, and (ii) Differential Comparison
Extraction module that identifies the applied transformations by comparing the
original and watermarked code. Leveraging the cross-lingual generalization
ability of LLM, CodeMark-LLM avoids language-specific engineering and training
pipelines. Extensive experiments across diverse programming languages and
attack scenarios demonstrate its robustness, effectiveness, and scalability.
Ссылки и действия
Дополнительные ресурсы: