BanglaSTEM: A Parallel Corpus for Technical Domain Bangla-English Translation
2511.03498v1
cs.CL, cs.LG
2025-11-07
Авторы:
Kazi Reyazul Hasan, Mubasshira Musarrat, A. B. M. Alim Al Islam, Muhammad Abdullah Adnan
Abstract
Large language models work well for technical problem solving in English but
perform poorly when the same questions are asked in Bangla. A simple solution
would be to translate Bangla questions into English first and then use these
models. However, existing Bangla-English translation systems struggle with
technical terms. They often mistranslate specialized vocabulary, which changes
the meaning of the problem and leads to wrong answers. We present BanglaSTEM, a
dataset of 5,000 carefully selected Bangla-English sentence pairs from STEM
fields including computer science, mathematics, physics, chemistry, and
biology. We generated over 12,000 translations using language models and then
used human evaluators to select the highest quality pairs that preserve
technical terminology correctly. We train a T5-based translation model on
BanglaSTEM and test it on two tasks: generating code and solving math problems.
Our results show significant improvements in translation accuracy for technical
content, making it easier for Bangla speakers to use English-focused language
models effectively. Both the BanglaSTEM dataset and the trained translation
model are publicly released at https://huggingface.co/reyazul/BanglaSTEM-T5.
Ссылки и действия
Дополнительные ресурсы: