Transformer Redesign for Late Fusion of Audio-Text Features on Ultra-Low-Power Edge Hardware
2510.18036v1
cs.SD, cs.LG, eess.AS
2025-10-23
Авторы:
Stavros Mitsis, Ermos Hadjikyriakos, Humaid Ibrahim, Savvas Neofytou, Shashwat Raman, James Myles, Eiman Kanjo
Abstract
Deploying emotion recognition systems in real-world environments where
devices must be small, low-power, and private remains a significant challenge.
This is especially relevant for applications such as tension monitoring,
conflict de-escalation, and responsive wearables, where cloud-based solutions
are impractical. Multimodal emotion recognition has advanced through deep
learning, but most systems remain unsuitable for deployment on
ultra-constrained edge devices. Prior work typically relies on powerful
hardware, lacks real-time performance, or uses unimodal input. This paper
addresses that gap by presenting a hardware-aware emotion recognition system
that combines acoustic and linguistic features using a late-fusion architecture
optimised for Edge TPU. The design integrates a quantised transformer-based
acoustic model with frozen keyword embeddings from a DSResNet-SE network,
enabling real-time inference within a 1.8MB memory budget and 21-23ms latency.
The pipeline ensures spectrogram alignment between training and deployment
using MicroFrontend and MLTK. Evaluation on re-recorded, segmented IEMOCAP
samples captured through the Coral Dev Board Micro microphone shows a 6.3%
macro F1 improvement over unimodal baselines. This work demonstrates that
accurate, real-time multimodal emotion inference is achievable on
microcontroller-class edge platforms through task-specific fusion and
hardware-guided model design.
Ссылки и действия
Дополнительные ресурсы: