Languages are Modalities: Cross-Lingual Alignment via Encoder Injection
2510.27254v1
cs.CL, cs.AI, cs.LG
2025-11-04
Авторы:
Rajan Agarwal, Aarush Gupta
Abstract
Instruction-tuned Large Language Models (LLMs) underperform on low resource,
non-Latin scripts due to tokenizer fragmentation and weak cross-lingual
coupling. We present LLINK (Latent Language Injection for Non-English
Knowledge), a compute efficient language-as-modality method that conditions an
instruction-tuned decoder without changing the tokenizer or retraining the
decoder. First, we align sentence embeddings from a frozen multilingual encoder
to the decoder's latent embedding space at a reserved position via a
lightweight contrastive projector. Second, the vector is expanded into K soft
slots and trained with minimal adapters so the frozen decoder consumes the
signal. LLINK substantially improves bilingual retrieval and achieves 81.3%
preference over the base model and 63.6% over direct fine-tuning in LLM-judged
Q&A evaluations. We further find that improvements can be attributed to reduced
tokenization inflation and a stronger cross lingual alignment, despite the
model having residual weaknesses in numeric fidelity. Treating low resource
languages as a modality offers a practical path to stronger cross-lingual
alignment in lightweight LLMs.
Ссылки и действия
Дополнительные ресурсы: