📊 Статистика дайджестов

Всего дайджестов: 34022 Добавлено сегодня: 0

Последнее обновление: сегодня

📄 Toward Conversational Hungarian Speech Recognition: Introducing the BEA-Large and BEA-Dialogue Datasets

2025-11-19

Авторы:

Máté Gedeon, Piroska Zsófia Barta, Péter Mihajlik, Tekla Etelka Gráczi, Anna Kohári, Katalin Mády

Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']

Annotation:

The advancement of automatic speech recognition (ASR) has been largely enhanced by extensive datasets in high-resource languages, while languages such as Hungarian remain underrepresented due to limited spontaneous and conversational corpora. To address this gap, we introduce two new datasets -- BEA-Large and BEA-Dialogue -- constructed from the previously unprocessed portions of the Hungarian speech corpus named BEA. BEA-Large extends BEA-Base with 255 hours of spontaneous speech from 433 speak...

ID: 2511.13529v1 cs.CL, cs.AI, cs.SD, eess.AS

arXiv PDF

📄 VocalNet-M2: Advancing Low-Latency Spoken Language Modeling via Integrated Multi-Codebook Tokenization and Multi-Token Prediction

2025-11-15

Авторы:

Yuhao Wang, Ziyang Cheng, Heyang Liu, Ronghua Wu, Qunshan Gu, Yanfeng Wang, Yu Wang

Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']

Annotation:

Current end-to-end spoken language models (SLMs) have made notable progress, yet they still encounter considerable response latency. This delay primarily arises from the autoregressive generation of speech tokens and the reliance on complex flow-matching models for speech synthesis. To overcome this, we introduce VocalNet-M2, a novel low-latency SLM that integrates a multi-codebook tokenizer and a multi-token prediction (MTP) strategy. Our model directly generates multi-codebook speech tokens, t...

ID: 2511.10232v1 cs.CL, cs.AI, cs.SD

arXiv PDF

📄 Probing the Hidden Talent of ASR Foundation Models for L2 English Oral Assessment

2025-10-22

Авторы:

Fu-An Chao, Bi-Cheng Yan, Berlin Chen

Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']

Annotation:

In this paper, we explore the untapped potential of Whisper, a well-established automatic speech recognition (ASR) foundation model, in the context of L2 spoken language assessment (SLA). Unlike prior studies that extrinsically analyze transcriptions produced by Whisper, our approach goes a step further to probe its latent capabilities by extracting acoustic and linguistic features from hidden representations. With only a lightweight classifier being trained on top of Whisper's intermediate and ...

ID: 2510.16387v1 cs.CL, cs.AI, cs.SD, eess.AS

arXiv PDF

📄 Extending Audio Context for Long-Form Understanding in Large Audio-Language Models

2025-10-21

Авторы:

Yuatyong Chaichana, Pittawat Taveekitworachai, Warit Sirichotedumrong, Potsawee Manakul, Kunat Pipatanakul

Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']

Annotation:

Large Audio-Language Models (LALMs) are often constrained by short audio context windows, even when their text backbones support long contexts, limiting long-form audio understanding. Prior work has introduced context-extension methods (e.g. YaRN) on unimodal LLMs, yet their application to LALMs remains unexplored. First, building on RoPE-based context extension, we introduce Partial YaRN, a training-free, audio-only extension method that modifies only audio token positions, leaving text positio...

ID: 2510.15231v1 cs.CL, cs.AI, cs.SD, eess.AS

arXiv PDF

📄 A Critical Review of the Need for Knowledge-Centric Evaluation of Quranic Recitation

2025-10-17

Авторы:

Mohammed Hilal Al-Kharusi, Khizar Hayat, Khalil Bader Al Ruqeishi, Haroon Rashid Lone

Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']

Annotation:

The sacred practice of Quranic recitation (Tajweed), governed by precise phonetic, prosodic, and theological rules, faces significant pedagogical challenges in the modern era. While digital technologies promise unprecedented access to education, automated tools for recitation evaluation have failed to achieve widespread adoption or pedagogical efficacy. This literature review investigates this critical gap, conducting a comprehensive analysis of academic research, web platforms, and commercial a...

ID: 2510.12858v1 cs.CL, cs.AI, cs.SD

arXiv PDF

📄 Open ASR Leaderboard: Towards Reproducible and Transparent Multilingual and Long-Form Speech Recognition Evaluation

2025-10-10

Авторы:

Vaibhav Srivastav, Steven Zheng, Eric Bezzam, Eustache Le Bihan, Nithin Koluguri, Piotr Żelasko, Somshubra Majumdar, Adel Moumen, Sanchit Gandhi

Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']

Annotation:

Despite rapid progress, ASR evaluation remains saturated with short-form English, and efficiency is rarely reported. We present the Open ASR Leaderboard, a fully reproducible benchmark and interactive leaderboard comparing 60+ open-source and proprietary systems across 11 datasets, including dedicated multilingual and long-form tracks. We standardize text normalization and report both word error rate (WER) and inverse real-time factor (RTFx), enabling fair accuracy-efficiency comparisons. For En...

ID: 2510.06961v2 cs.CL, cs.AI, cs.SD, eess.AS

arXiv PDF

📄 Data-efficient Targeted Token-level Preference Optimization for LLM-based Text-to-Speech

2025-10-09

Авторы:

Rikuto Kotoge, Yuichi Sasaki

Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']

Annotation:

Aligning text-to-speech (TTS) system outputs with human feedback through preference optimization has been shown to effectively improve the robustness and naturalness of language model-based TTS models. Current approaches primarily require paired desirable and undesirable samples at the utterance level. However, such pairs are often limited in TTS output data, and utterance-level formulation prevents fine-grained token-level optimization needed for accurate pronunciation alignment. In this study,...

ID: 2510.05799v1 cs.CL, cs.AI, cs.SD

arXiv PDF

📄 SAGE-LD: Towards Scalable and Generalizable End-to-End Language Diarization via Simulated Data Augmentation

2025-10-04

Авторы:

Sangmin Lee, Woongjib Choi, Jihyun Kim, Hong-Goo Kang

Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']

Annotation:

In this paper, we present a neural spoken language diarization model that supports an unconstrained span of languages within a single framework. Our approach integrates a learnable query-based architecture grounded in multilingual awareness, with large-scale pretraining on simulated code-switching data. By jointly leveraging these two components, our method overcomes the limitations of conventional approaches in data scarcity and architecture optimization, and generalizes effectively to real-wor...

ID: 2510.00582v1 cs.CL, cs.AI, cs.SD

arXiv PDF

📄 Emotion-Aligned Generation in Diffusion Text to Speech Models via Preference-Guided Optimization

2025-10-02

Авторы:

Jiacheng Shi, Hongfei Du, Yangfan He, Y. Alicia Hong, Ye Gao

Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']

Annotation:

Emotional text-to-speech seeks to convey affect while preserving intelligibility and prosody, yet existing methods rely on coarse labels or proxy classifiers and receive only utterance-level feedback. We introduce Emotion-Aware Stepwise Preference Optimization (EASPO), a post-training framework that aligns diffusion TTS with fine-grained emotional preferences at intermediate denoising steps. Central to our approach is EASPM, a time-conditioned model that scores noisy intermediate speech states a...

ID: 2509.25416v1 cs.CL, cs.AI, cs.SD

arXiv PDF

📄 SloPalSpeech: A 2,8000-Hour Slovak Speech Corpus from Parliamentary Data

2025-09-25

Авторы:

Erik Božík, Marek Šuppa

## Контекст Slovak является низкоресурсной языковой системой в области Automatic Speech Recognition (ASR). Ограниченное количество доступных данных и полнотекстовых корпусов ставит перед исследователями серьезные проблемы при разработке эффективных ASR-систем. Эти проблемы становятся особенно актуальными в сфере диалоговых искусственных интеллектов, где необходима высокая точность распознавания речи. Наличие качественных, больших корпусов данных является ключевым фактором для отрасли. Однако, на данный момент, такие корпусы для словацкого языка редко. Мы предлагаем SloPalSpeech — крупнейший по размеру сейчас аср-датасет для словацкого языка, содержащий 2,806 часов речевых данных, полученных из парламентских протоколов. Данный корпус представляет собой значительный улучшение по размеру и качеству данных по сравнению с предыдущими датасетом. ## Метод SloPalSpeech был создан с использованием робостых методов обработки данных. Длинные записи из парламентских протоколов были выровнены и разбиты на чистые, 30-секундных аудио-пары с текстовой меткой. Это позволило получить высококачественный датасет для обучения ASR-систем. Мы создали pipeline для сегментации и выравнивания, который обеспечивает высокую точность и значительную уменьшение шума в данных. Далее, мы применили этот pipeline для обработки и создания SloPalSpeech. Датасет был разделен на тренировочную и тестовую выборки, чтобы позволить разработке и оценке ASR-систем. ## Результаты Мы провели ряд экспериментов с помощью SloPalSpeech, используя модель OpenAI Whisper. Мы показали, что fine-tuning моделей Whisper-small, Whisper-medium и Whisper-large-v3 на нашем датасете приводит к существенному улучшению в распознавании речи. Наиболее заметный результат — уменьшение Word Error Rate (WER) до 70% в сравнении с базовой моделью на некоторых стандартных тестах, таких как Common Voice и FLEURS. Мы доказали, что SloPalSpeech может эффективно использоваться для обучения ASR-систем, даже для таких низкоресурсных языков, как словацкий. ## Значимость Наша работа имеет значительное значение для развития ASR-систем для низкоресурсных языков. Мы публикуем SloPalSpeech вместе с полностью отформатированными текстами — более 60 миллионов слов. Это дает возможность другим исследователям продолжать развивать и оптимизировать ASR-системы. Наш корпус также может быть применен в других областях, таких как диалоговые системы и переводчики, где точность распознавания речи критична. ## Выводы Мы представили SloPalSpeech — крупнейший датасет для ASR в словацком языке. Наша работа показала, что этот датасет эффективно может использоваться для fine-tuning ASR-систем, даже для низкоресурсных языков. М

Annotation:

Automatic Speech Recognition (ASR) for low-resource languages like Slovak is hindered by the scarcity of training data. To address this, we introduce SloPalSpeech, a new, large-scale Slovak ASR dataset containing 2,806 hours of speech from parliamentary proceedings. We developed a robust processing pipeline to align and segment long-form recordings into clean, 30-second audio-transcript pairs suitable for model training. We use this dataset to fine-tune several OpenAI Whisper models (small, medi...

ID: 2509.19270v1 cs.CL, cs.AI, cs.SD

arXiv PDF

Показано 1 - 10 из 22 записей