SFMS-ALR: Script-First Multilingual Speech Synthesis with Adaptive Locale Resolution
2510.25178v1
cs.SD, cs.AI, eess.AS, I.2.7; H.5.5
2025-11-01
Авторы:
Dharma Teja Donepudi
Abstract
Intra-sentence multilingual speech synthesis (code-switching TTS) remains a
major challenge due to abrupt language shifts, varied scripts, and mismatched
prosody between languages. Conventional TTS systems are typically monolingual
and fail to produce natural, intelligible speech in mixed-language contexts. We
introduce Script-First Multilingual Synthesis with Adaptive Locale Resolution
(SFMS-ALR), an engine-agnostic framework for fluent, real-time code-switched
speech generation. SFMS-ALR segments input text by Unicode script, applies
adaptive language identification to determine each segment's language and
locale, and normalizes prosody using sentiment-aware adjustments to preserve
expressive continuity across languages. The algorithm generates a unified SSML
representation with appropriate "lang" or "voice" spans and synthesizes the
utterance in a single TTS request. Unlike end-to-end multilingual models,
SFMS-ALR requires no retraining and integrates seamlessly with existing voices
from Google, Apple, Amazon, and other providers. Comparative analysis with
data-driven pipelines such as Unicom and Mask LID demonstrates SFMS-ALR's
flexibility, interpretability, and immediate deployability. The framework
establishes a modular baseline for high-quality, engine-independent
multilingual TTS and outlines evaluation strategies for intelligibility,
naturalness, and user preference.