ParsVoice: A Large-Scale Multi-Speaker Persian Speech Corpus for Text-to-Speech Synthesis
2510.10774v2
cs.SD, cs.AI, cs.HC, cs.LG
2025-10-16
Авторы:
Mohammad Javad Ranjbar Kalahroodi, Heshaam Faili, Azadeh Shakery
Abstract
Existing Persian speech datasets are typically smaller than their English
counterparts, which creates a key limitation for developing Persian speech
technologies. We address this gap by introducing ParsVoice, the largest Persian
speech corpus designed specifically for text-to-speech(TTS) applications. We
created an automated pipeline that transforms raw audiobook content into
TTS-ready data, incorporating components such as a BERT-based sentence
completion detector, a binary search boundary optimization method for precise
audio-text alignment, and audio-text quality assessment frameworks tailored to
Persian. The pipeline processes 2,000 audiobooks, yielding 3,526 hours of clean
speech, which was further filtered into a 1,804-hour high-quality subset
suitable for TTS, featuring more than 470 speakers. To validate the dataset, we
fine-tuned XTTS for Persian, achieving a naturalness Mean Opinion Score (MOS)
of 3.6/5 and a Speaker Similarity Mean Opinion Score (SMOS) of 4.0/5
demonstrating ParsVoice's effectiveness for training multi-speaker TTS systems.
ParsVoice is the largest high-quality Persian speech dataset, offering speaker
diversity and audio quality comparable to major English corpora. The complete
dataset has been made publicly available to accelerate the development of
Persian speech technologies. The ParsVoice dataset is publicly available at:
https://huggingface.co/datasets/MohammadJRanjbar/ParsVoice.