Segmenting Collision Sound Sources in Egocentric Videos

2511.13863v2 cs.CV, cs.SD, eess.AS 2025-11-21
Авторы:

Kranti Kumar Parida, Omar Emara, Hazel Doughty, Dima Damen

Abstract

Humans excel at multisensory perception and can often recognise object properties from the sound of their interactions. Inspired by this, we propose the novel task of Collision Sound Source Segmentation (CS3), where we aim to segment the objects responsible for a collision sound in visual input (i.e. video frames from the collision clip), conditioned on the audio. This task presents unique challenges. Unlike isolated sound events, a collision sound arises from interactions between two objects, and the acoustic signature of the collision depends on both. We focus on egocentric video, where sounds are often clear, but the visual scene is cluttered, objects are small, and interactions are brief. To address these challenges, we propose a weakly-supervised method for audio-conditioned segmentation, utilising foundation models (CLIP and SAM2). We also incorporate egocentric cues, i.e. objects in hands, to find acting objects that can potentially be collision sound sources. Our approach outperforms competitive baselines by $3\times$ and $4.7\times$ in mIoU on two benchmarks we introduce for the CS3 task: EPIC-CS3 and Ego4D-CS3.

Ссылки и действия

Связанные статьи

Voice Pathology Detection Using Phonation

## Контекст Осложнения в речи и голосовые расстройства значительно сказываются на качестве жизни и общении, требуя опера...

2025-08-13

Pindrop it! Audio and Visual Deepfake Countermeasures for Robust Detection and F...

## Контекст В последние годы появились совершенно новые стейт-оф-артные технологии, позволяющие генерировать аудио и ви...

2025-08-13

How Would It Sound? Material-Controlled Multimodal Acoustic Profile Generation f...

**Резюме** В статье предлагается задача материал-контролируемой генерации акустических профилей для индорной сцены, где...

2025-08-09

From Detection to Correction: Backdoor-Resilient Face Recognition via Vision-Lan...

Объективная система распознавания лиц на основе глубоких нейронных сетей требует безопасности и достоверности данных. Ат...

2025-08-09