MS-Mix: Unveiling the Power of Mixup for Multimodal Sentiment Analysis
2510.11579v1
cs.CV, cs.LG
2025-10-15
Авторы:
Hongyu Zhu, Lin Chen, Mounim A. El-Yacoubi, Mingsheng Shang
Abstract
Multimodal Sentiment Analysis (MSA) aims to identify and interpret human
emotions by integrating information from heterogeneous data sources such as
text, video, and audio. While deep learning models have advanced in network
architecture design, they remain heavily limited by scarce multimodal annotated
data. Although Mixup-based augmentation improves generalization in unimodal
tasks, its direct application to MSA introduces critical challenges: random
mixing often amplifies label ambiguity and semantic inconsistency due to the
lack of emotion-aware mixing mechanisms. To overcome these issues, we propose
MS-Mix, an adaptive, emotion-sensitive augmentation framework that
automatically optimizes sample mixing in multimodal settings. The key
components of MS-Mix include: (1) a Sentiment-Aware Sample Selection (SASS)
strategy that effectively prevents semantic confusion caused by mixing samples
with contradictory emotions. (2) a Sentiment Intensity Guided (SIG) module
using multi-head self-attention to compute modality-specific mixing ratios
dynamically based on their respective emotional intensities. (3) a Sentiment
Alignment Loss (SAL) that aligns the prediction distributions across
modalities, and incorporates the Kullback-Leibler-based loss as an additional
regularization term to train the emotion intensity predictor and the backbone
network jointly. Extensive experiments on three benchmark datasets with six
state-of-the-art backbones confirm that MS-Mix consistently outperforms
existing methods, establishing a new standard for robust multimodal sentiment
augmentation. The source code is available at:
https://github.com/HongyuZhu-s/MS-Mix.
Ссылки и действия
Дополнительные ресурсы: