POTSA: A Cross-Lingual Speech Alignment Framework for Low Resource Speech-to-Text Translation

2511.09232v1 cs.CL, cs.SD 2025-11-15

Авторы:

Xuanchen Li, Chenrui Cui, Tianrui Wang, Meng Ge, Zikang Huang, Jin Li, Yizhou Peng, Longbiao Wang, Jianwu Dang, Nyima Tashi

Abstract

Speech Large Language Models (SpeechLLMs) have achieved breakthroughs in multilingual speech-to-text translation (S2TT). However, existing approaches often overlook semantic commonalities across source languages, leading to biased translation performance. In this work, we propose \textbf{POTSA} (Parallel Optimal Transport for Speech Alignment), a new framework based on cross-lingual parallel speech pairs and Optimal Transport (OT), designed to bridge high- and low-resource translation gaps. First, we introduce a Bias Compensation module to coarsely align initial speech representations across languages. Second, we impose token-level OT constraints on a Q-Former using parallel speech pairs to establish fine-grained consistency of representations. Then, we apply a layer scheduling strategy to focus OT constraints on the most semantically beneficial layers. Experiments on the FLEURS dataset show that our method achieves SOTA performance, with +0.93 BLEU on average over five common languages and +5.05 BLEU on zero-shot languages, using only 10 hours of parallel speech per source language.

Ссылки и действия

Читать на arXiv Скачать PDF

Дополнительные ресурсы:

POTSA: A Cross-Lingual Speech Alignment Framework for Low Resource Speech-to-Text Translation

Авторы:

Abstract

Ссылки и действия

Связанные статьи

Dialect Identification Using Resource-Efficient Fine-Tuning Approaches

A new kid on the block: Distributional semantics predicts the word-specific tone...

CLiFT-ASR: A Cross-Lingual Fine-Tuning Framework for Low-Resource Taiwanese Hokk...

CantoASR: Prosody-Aware ASR-LALM Collaboration for Low-Resource Cantonese

MLMA: Towards Multilingual ASR With Mamba-based Architectures

Навигация