Emotion-Disentangled Embedding Alignment for Noise-Robust and Cross-Corpus Speech Emotion Recognition
2510.09072v1
cs.SD, cs.AI, cs.HC, cs.LG, eess.AS
2025-10-14
Авторы:
Upasana Tiwari, Rupayan Chakraborty, Sunil Kumar Kopparapu
Abstract
Effectiveness of speech emotion recognition in real-world scenarios is often
hindered by noisy environments and variability across datasets. This paper
introduces a two-step approach to enhance the robustness and generalization of
speech emotion recognition models through improved representation learning.
First, our model employs EDRL (Emotion-Disentangled Representation Learning) to
extract class-specific discriminative features while preserving shared
similarities across emotion categories. Next, MEA (Multiblock Embedding
Alignment) refines these representations by projecting them into a joint
discriminative latent subspace that maximizes covariance with the original
speech input. The learned EDRL-MEA embeddings are subsequently used to train an
emotion classifier using clean samples from publicly available datasets, and
are evaluated on unseen noisy and cross-corpus speech samples. Improved
performance under these challenging conditions demonstrates the effectiveness
of the proposed method.