AUREXA-SE: Audio-Visual Unified Representation Exchange Architecture with Cross-Attention and Squeezeformer for Speech Enhancement
2510.05295v1
cs.SD, cs.AI, cs.MM
2025-10-09
Авторы:
M. Sajid, Deepanshu Gupta, Yash Modi, Sanskriti Jain, Harshith Jai Surya Ganji, A. Rahaman, Harshvardhan Choudhary, Nasir Saleem, Amir Hussain, M. Tanveer
Abstract
In this paper, we propose AUREXA-SE (Audio-Visual Unified Representation
Exchange Architecture with Cross-Attention and Squeezeformer for Speech
Enhancement), a progressive bimodal framework tailored for audio-visual speech
enhancement (AVSE). AUREXA-SE jointly leverages raw audio waveforms and visual
cues by employing a U-Net-based 1D convolutional encoder for audio and a Swin
Transformer V2 for efficient and expressive visual feature extraction. Central
to the architecture is a novel bidirectional cross-attention mechanism, which
facilitates deep contextual fusion between modalities, enabling rich and
complementary representation learning. To capture temporal dependencies within
the fused embeddings, a stack of lightweight Squeezeformer blocks combining
convolutional and attention modules is introduced. The enhanced embeddings are
then decoded via a U-Net-style decoder for direct waveform reconstruction,
ensuring perceptually consistent and intelligible speech output. Experimental
evaluations demonstrate the effectiveness of AUREXA-SE, achieving significant
performance improvements over noisy baselines, with STOI of 0.516, PESQ of
1.323, and SI-SDR of -4.322 dB. The source code of AUREXA-SE is available at
https://github.com/mtanveer1/AVSEC-4-Challenge-2025.
Ссылки и действия
Дополнительные ресурсы: