SentiMaithili: A Benchmark Dataset for Sentiment and Reason Generation for the Low-Resource Maithili Language
2510.22160v1
cs.CL, cs.CV
2025-10-29
Авторы:
Rahul Ranjan, Mahendra Kumar Gurve, Anuj, Nitin, Yamuna Prasad
Abstract
Developing benchmark datasets for low-resource languages poses significant
challenges, primarily due to the limited availability of native linguistic
experts and the substantial time and cost involved in annotation. Given these
challenges, Maithili is still underrepresented in natural language processing
research. It is an Indo-Aryan language spoken by more than 13 million people in
the Purvanchal region of India, valued for its rich linguistic structure and
cultural significance. While sentiment analysis has achieved remarkable
progress in high-resource languages, resources for low-resource languages, such
as Maithili, remain scarce, often restricted to coarse-grained annotations and
lacking interpretability mechanisms. To address this limitation, we introduce a
novel dataset comprising 3,221 Maithili sentences annotated for sentiment
polarity and accompanied by natural language justifications. Moreover, the
dataset is carefully curated and validated by linguistic experts to ensure both
label reliability and contextual fidelity. Notably, the justifications are
written in Maithili, thereby promoting culturally grounded interpretation and
enhancing the explainability of sentiment models. Furthermore, extensive
experiments using both classical machine learning and state-of-the-art
transformer architectures demonstrate the dataset's effectiveness for
interpretable sentiment analysis. Ultimately, this work establishes the first
benchmark for explainable affective computing in Maithili, thus contributing a
valuable resource to the broader advancement of multilingual NLP and
explainable AI.
Ссылки и действия
Дополнительные ресурсы: