DuoLens: A Framework for Robust Detection of Machine-Generated Multilingual Text and Code
2510.18904v1
cs.CL, cs.AI, cs.IR, cs.LG
2025-10-24
Авторы:
Shriyansh Agrawal, Aidan Lau, Sanyam Shah, Ahan M R, Kevin Zhu, Sunishchal Dev, Vasu Sharma
Abstract
The prevalence of Large Language Models (LLMs) for generating multilingual
text and source code has only increased the imperative for machine-generated
content detectors to be accurate and efficient across domains. Current
detectors, predominantly utilizing zero-shot methods, such as Fast DetectGPT or
GPTZero, either incur high computational cost or lack sufficient accuracy,
often with a trade-off between the two, leaving room for further improvement.
To address these gaps, we propose the fine-tuning of encoder-only Small
Language Models (SLMs), in particular, the pre-trained models of RoBERTA and
CodeBERTa using specialized datasets on source code and other natural language
to prove that for the task of binary classification, SLMs outperform LLMs by a
huge margin whilst using a fraction of compute. Our encoders achieve AUROC $=
0.97$ to $0.99$ and macro-F1 $0.89$ to $0.94$ while reducing latency by
$8$-$12\times$ and peak VRAM by $3$-$5\times$ at $512$-token inputs. Under
cross-generator shifts and adversarial transformations (paraphrase,
back-translation; code formatting/renaming), performance retains $\geq 92%$ of
clean AUROC. We release training and evaluation scripts with seeds and configs;
a reproducibility checklist is also included.