Multi-Loss Learning for Speech Emotion Recognition with Energy-Adaptive Mixup and Frame-Level Attention

2512.04551v1 cs.SD, cs.AI, eess.AS 2025-12-06

Авторы:

Cong Wang, Yizhong Geng, Yuhua Wen, Qifei Li, Yingming Gao, Ruimin Wang, Chunfeng Wang, Hao Li, Ya Li, Wei Chen

Abstract

Speech emotion recognition (SER) is an important technology in human-computer interaction. However, achieving high performance is challenging due to emotional complexity and scarce annotated data. To tackle these challenges, we propose a multi-loss learning (MLL) framework integrating an energy-adaptive mixup (EAM) method and a frame-level attention module (FLAM). The EAM method leverages SNR-based augmentation to generate diverse speech samples capturing subtle emotional variations. FLAM enhances frame-level feature extraction for multi-frame emotional cues. Our MLL strategy combines Kullback-Leibler divergence, focal, center, and supervised contrastive loss to optimize learning, address class imbalance, and improve feature separability. We evaluate our method on four widely used SER datasets: IEMOCAP, MSP-IMPROV, RAVDESS, and SAVEE. The results demonstrate our method achieves state-of-the-art performance, suggesting its effectiveness and robustness.

Ссылки и действия

Читать на arXiv Скачать PDF

Дополнительные ресурсы:

Multi-Loss Learning for Speech Emotion Recognition with Energy-Adaptive Mixup and Frame-Level Attention

Авторы:

Abstract

Ссылки и действия

Связанные статьи

RRPO: Robust Reward Policy Optimization for LLM-based Emotional TTS

Multidimensional Music Aesthetic Evaluation via Semantically Consistent C-Mixup ...

Aligning Generative Music AI with Human Preferences: Methods and Challenges

Real-Time Speech Enhancement via a Hybrid ViT: A Dual-Input Acoustic-Image Featu...

FoleyBench: A Benchmark For Video-to-Audio Models

Навигация