LibriConvo: Simulating Conversations from Read Literature for ASR and Diarization
2510.23320v1
eess.AS, cs.CL, cs.SD
2025-10-29
Авторы:
Máté Gedeon, Péter Mihajlik
Abstract
We introduce LibriConvo, a simulated multi-speaker conversational dataset
based on speaker-aware conversation simulation (SASC), designed to support
training and evaluation of speaker diarization and automatic speech recognition
(ASR) systems. Unlike prior resources that mostly rely on semantically
disconnected utterances and implausible temporal gaps, LibriConvo ensures
semantic coherence and realistic conversational timing. Our pipeline leverages
CallHome with external VAD for reliable boundaries, applies compression to
reduce unnaturally long silences, and organizes LibriTTS utterances by book to
maintain contextual consistency. Acoustic realism is enhanced via a novel room
impulse response selection procedure that ranks speaker-microphone
configurations by spatial plausibility, balancing realism and diversity. The
dataset comprises 240.1 hours across 1,496 dialogues with 830 unique speakers,
split in a speaker-disjoint manner for robust evaluation. Baselines show that
the sortformer model outperforms the pyannote pipeline in diarization, while a
fine-tuned Fast Conformer-CTC XLarge with Serialized Output Training achieves
7.29\% WER for ASR, surpassing zero-shot Whisper-large-v3. LibriConvo provides
a valuable resource for advancing multi-speaker speech processing research with
realistic conversational dynamics and controlled experimental conditions.
Ссылки и действия
Дополнительные ресурсы: