Shared Multi-modal Embedding Space for Face-Voice Association

2512.04814v1 cs.SD, cs.CV 2025-12-05

Авторы:

Christopher Simic, Korbinian Riedhammer, Tobias Bocklet

Abstract

The FAME 2026 challenge comprises two demanding tasks: training face-voice associations combined with a multilingual setting that includes testing on languages on which the model was not trained. Our approach consists of separate uni-modal processing pipelines with general face and voice feature extraction, complemented by additional age-gender feature extraction to support prediction. The resulting single-modal features are projected into a shared embedding space and trained with an Adaptive Angular Margin (AAM) loss. Our approach achieved first place in the FAME 2026 challenge, with an average Equal-Error Rate (EER) of 23.99%.

Ссылки и действия

Читать на arXiv Скачать PDF

Дополнительные ресурсы:

Найти цитирования в Google Scholar
Поиск в Semantic Scholar
Другие статьи категории cs.SD, cs.CV

Shared Multi-modal Embedding Space for Face-Voice Association

Авторы:

Abstract

Ссылки и действия

Связанные статьи

Multi-Reward GRPO for Stable and Prosodic Single-Codebook TTS LLMs at Scale

A Novel CustNetGC Boosted Model with Spectral Features for Parkinson's Disease P...

Efficient Audio-Visual Speech Separation with Discrete Lip Semantics and Multi-S...

LTA-L2S: Lexical Tone-Aware Lip-to-Speech Synthesis for Mandarin with Cross-Ling...

Discovering "Words" in Music: Unsupervised Learning of Compositional Sparse Code...

Навигация