Emotional Text-To-Speech Based on Mutual-Information-Guided Emotion-Timbre Disentanglement

2510.01722v1 cs.SD, cs.AI, eess.AS 2025-10-04

Авторы:

Jianing Yang, Sheng Li, Takahiro Shinozaki, Yuki Saito, Hiroshi Saruwatari

Abstract

Current emotional Text-To-Speech (TTS) and style transfer methods rely on reference encoders to control global style or emotion vectors, but do not capture nuanced acoustic details of the reference speech. To this end, we propose a novel emotional TTS method that enables fine-grained phoneme-level emotion embedding prediction while disentangling intrinsic attributes of the reference speech. The proposed method employs a style disentanglement method to guide two feature extractors, reducing mutual information between timbre and emotion features, and effectively separating distinct style components from the reference speech. Experimental results demonstrate that our method outperforms baseline TTS systems in generating natural and emotionally rich speech. This work highlights the potential of disentangled and fine-grained representations in advancing the quality and flexibility of emotional TTS systems.

Ссылки и действия

Читать на arXiv Скачать PDF

Дополнительные ресурсы:

Emotional Text-To-Speech Based on Mutual-Information-Guided Emotion-Timbre Disentanglement

Авторы:

Abstract

Ссылки и действия

Связанные статьи

RRPO: Robust Reward Policy Optimization for LLM-based Emotional TTS

Multi-Loss Learning for Speech Emotion Recognition with Energy-Adaptive Mixup an...

Multidimensional Music Aesthetic Evaluation via Semantically Consistent C-Mixup ...

Aligning Generative Music AI with Human Preferences: Methods and Challenges

Real-Time Speech Enhancement via a Hybrid ViT: A Dual-Input Acoustic-Image Featu...

Навигация