AI-Generated Text Detection in Low-Resource Languages: A Case Study on Urdu
2510.16573v1
cs.CL, cs.AI, cs.LG
2025-10-22
Авторы:
Muhammad Ammar, Hadiya Murad Hadi, Usman Majeed Butt
Abstract
Large Language Models (LLMs) are now capable of generating text that closely
resembles human writing, making them powerful tools for content creation, but
this growing ability has also made it harder to tell whether a piece of text
was written by a human or by a machine. This challenge becomes even more
serious for languages like Urdu, where there are very few tools available to
detect AI-generated text. To address this gap, we propose a novel AI-generated
text detection framework tailored for the Urdu language. A balanced dataset
comprising 1,800 humans authored, and 1,800 AI generated texts, sourced from
models such as Gemini, GPT-4o-mini, and Kimi AI was developed. Detailed
linguistic and statistical analysis was conducted, focusing on features such as
character and word counts, vocabulary richness (Type Token Ratio), and N-gram
patterns, with significance evaluated through t-tests and MannWhitney U tests.
Three state-of-the-art multilingual transformer models such as
mdeberta-v3-base, distilbert-base-multilingualcased, and xlm-roberta-base were
fine-tuned on this dataset. The mDeBERTa-v3-base achieved the highest
performance, with an F1-score 91.29 and accuracy of 91.26% on the test set.
This research advances efforts in contesting misinformation and academic
misconduct in Urdu-speaking communities and contributes to the broader
development of NLP tools for low resource languages.
Ссылки и действия
Дополнительные ресурсы: