Qomhra: A Bilingual Irish-English Large Language Model
2510.17652v1
cs.CL, I.2.7
2025-10-22
Авторы:
Joseph McInerney
Abstract
This paper introduces Qomhr\'a, a bilingual Irish-English large language
model (LLM), developed under low-resource constraints presenting a complete
pipeline spanning bilingual continued pre-training, instruction tuning, and
alignment from human preferences. Newly accessible Irish corpora and English
text are mixed and curated to improve Irish performance while preserving
English ability. 6 closed-weight LLMs are judged for their Irish text
generation by a native speaker, a learner and other LLMs. Google's
Gemini-2.5-Pro is ranked the highest and is subsequently used to synthesise
instruction tuning and human preference datasets. Two datasets are contributed
leveraging Gemini-2.5-Pro: a 30K Irish-English parallel instruction tuning
dataset and a 1K human preference dataset, generating accepted and rejected
responses that show near perfect alignment with a native Irish speaker.
Qomhr\'a is comprehensively evaluated across benchmarks testing translation,
gender understanding, topic identification and world knowledge with gains of up
to 29% in Irish and 44% in English. Qomhr\'a also undergoes instruction tuning
and demonstrates clear progress in instruction following, crucial for chatbot
functionality.
Ссылки и действия
Дополнительные ресурсы: