Emotional Text-To-Speech Based on Mutual-Information-Guided Emotion-Timbre Disentanglement
2510.01722v1
cs.SD, cs.AI, eess.AS
2025-10-04
Авторы:
Jianing Yang, Sheng Li, Takahiro Shinozaki, Yuki Saito, Hiroshi Saruwatari
Abstract
Current emotional Text-To-Speech (TTS) and style transfer methods rely on
reference encoders to control global style or emotion vectors, but do not
capture nuanced acoustic details of the reference speech. To this end, we
propose a novel emotional TTS method that enables fine-grained phoneme-level
emotion embedding prediction while disentangling intrinsic attributes of the
reference speech. The proposed method employs a style disentanglement method to
guide two feature extractors, reducing mutual information between timbre and
emotion features, and effectively separating distinct style components from the
reference speech. Experimental results demonstrate that our method outperforms
baseline TTS systems in generating natural and emotionally rich speech. This
work highlights the potential of disentangled and fine-grained representations
in advancing the quality and flexibility of emotional TTS systems.
Ссылки и действия
Дополнительные ресурсы: