📊 Статистика дайджестов

Всего дайджестов: 34022 Добавлено сегодня: 82

Последнее обновление: сегодня

📄 Towards Audio Token Compression in Large Audio Language Models

2025-11-27

Авторы:

Saurabhchand Bhati, Samuel Thomas, Hilde Kuehne, Rogerio Feris, James Glass

Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']

Annotation:

Large Audio Language Models (LALMs) demonstrate impressive performance across diverse tasks, ranging from speech recognition to general audio understanding. However, their scalability is limited by the quadratic complexity of attention and the high token rates of audio signals. These challenges make it difficult to extend LALMs to long-form audio and to deploy them on resource-constrained platforms such as edge devices. In this paper, we explore techniques such as unsupervised segmentation, un...

ID: 2511.20973v1 eess.AS, cs.AI, cs.CL

arXiv PDF

📄 InstructAudio: Unified speech and music generation with natural language instruction

2025-11-25

Авторы:

Chunyu Qiang, Kang Yin, Xiaopeng Wang, Yuzhe Liang, Jiahui Zhao, Ruibo Fu, Tianrui Wang, Cheng Gong, Chen Zhang, Longbiao Wang, Jianwu Dang

Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']

Annotation:

Text-to-speech (TTS) and text-to-music (TTM) models face significant limitations in instruction-based control. TTS systems usually depend on reference audio for timbre, offer only limited text-level attribute control, and rarely support dialogue generation. TTM systems are constrained by input conditioning requirements that depend on expert knowledge annotations. The high heterogeneity of these input control conditions makes them difficult to joint modeling with speech synthesis. Despite sharing...

ID: 2511.18487v1 eess.AS, cs.AI, cs.CL, cs.SD

arXiv PDF

📄 Unifying Model and Layer Fusion for Speech Foundation Models

2025-11-15

Авторы:

Yi-Jen Shih, David Harwath

Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']

Annotation:

Speech Foundation Models have gained significant attention recently. Prior works have shown that the fusion of representations from multiple layers of the same model or the fusion of multiple models can improve performance on downstream tasks. We unify these two fusion strategies by proposing an interface module that enables fusion across multiple upstream speech models while integrating information across their layers. We conduct extensive experiments on different self-supervised and supervised...

ID: 2511.08389v1 eess.AS, cs.AI, cs.CL

arXiv PDF

📄 MULTI-Bench: A Multi-Turn Interactive Benchmark for Assessing Emotional Intelligence ability of Spoken Dialogue Models

2025-11-06

Авторы:

Yayue Deng, Guoqiang Hu, Haiyang Sun, Xiangyu Zhang, Haoyang Zhang, Fei Tian, Xuerui Yang, Gang Yu, Eng Siong Chng

Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']

Annotation:

Spoken Dialogue Models (SDMs) have advanced rapidly, yet their ability to sustain genuinely interactive multi-turn conversations remains underexplored, as most benchmarks focus on single-turn exchanges. We introduce Multi-Bench, the first benchmark explicitly designed to evaluate SDMs in multi-turn interactive dialogue with an emphasis on emotional intelligence. Multi-Bench employs a hierarchical structure with a basic track for emotion understanding and reasoning and an advanced track for emoti...

ID: 2511.00850v1 eess.AS, cs.AI, cs.CL, cs.SD

arXiv PDF

📄 Lost in Phonation: Voice Quality Variation as an Evaluation Dimension for Speech Foundation Models

2025-10-31

Авторы:

Harm Lameris, Shree Harsha Bokkahalli Satish, Joakim Gustafson, Éva Székely

Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']

Annotation:

Recent advances in speech foundation models (SFMs) have enabled the direct processing of spoken language from raw audio, bypassing intermediate textual representations. This capability allows SFMs to be exposed to, and potentially respond to, rich paralinguistic variations embedded in the input speech signal. One under-explored dimension of paralinguistic variation is voice quality, encompassing phonation types such as creaky and breathy voice. These phonation types are known to influence how li...

ID: 2510.25577v1 eess.AS, cs.AI, cs.CL

arXiv PDF

📄 A Neural Model for Contextual Biasing Score Learning and Filtering

2025-10-30

Авторы:

Wanting Huang, Weiran Wang

Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']

Annotation:

Contextual biasing improves automatic speech recognition (ASR) by integrating external knowledge, such as user-specific phrases or entities, during decoding. In this work, we use an attention-based biasing decoder to produce scores for candidate phrases based on acoustic information extracted by an ASR encoder, which can be used to filter out unlikely phrases and to calculate bonus for shallow-fusion biasing. We introduce a per-token discriminative objective that encourages higher scores for gro...

ID: 2510.23849v1 eess.AS, cs.AI, cs.CL, cs.SD

arXiv PDF

📄 StutterZero and StutterFormer: End-to-End Speech Conversion for Stuttering Transcription and Correction

2025-10-24

Авторы:

Qianheng Xu

Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']

Annotation:

Over 70 million people worldwide experience stuttering, yet most automatic speech systems misinterpret disfluent utterances or fail to transcribe them accurately. Existing methods for stutter correction rely on handcrafted feature extraction or multi-stage automatic speech recognition (ASR) and text-to-speech (TTS) pipelines, which separate transcription from audio reconstruction and often amplify distortions. This work introduces StutterZero and StutterFormer, the first end-to-end waveform-to-w...

ID: 2510.18938v1 eess.AS, cs.AI, cs.CL

arXiv PDF

📄 TokenChain: A Discrete Speech Chain via Semantic Token Modeling

2025-10-09

Авторы:

Mingxuan Wang, Satoshi Nakamura

Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']

Annotation:

Machine Speech Chain, simulating the human perception-production loop, proves effective in jointly improving ASR and TTS. We propose TokenChain, a fully discrete speech chain coupling semantic-token ASR with a two-stage TTS: an autoregressive text-to-semantic model co-trained with ASR and a masked-generative semantic-to-acoustic model for synthesis only. End-to-end feedback across the text interface is enabled with straight-through argmax/Gumbel-Softmax and balanced with supervised ASR via dynam...

ID: 2510.06201v1 eess.AS, cs.AI, cs.CL, cs.SD

arXiv PDF

📄 Game-Time: Evaluating Temporal Dynamics in Spoken Language Models

2025-10-02

Авторы:

Kai-Wei Chang, En-Pei Hu, Chun-Yi Kuan, Wenze Ren, Wei-Chih Chen, Guan-Ting Lin, Yu Tsao, Shao-Hua Sun, Hung-yi Lee, James Glass

Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']

Annotation:

Conversational Spoken Language Models (SLMs) are emerging as a promising paradigm for real-time speech interaction. However, their capacity of temporal dynamics, including the ability to manage timing, tempo and simultaneous speaking, remains a critical and unevaluated challenge for conversational fluency. To address this gap, we introduce the Game-Time Benchmark, a framework to systematically assess these temporal capabilities. Inspired by how humans learn a language through language activities...

ID: 2509.26388v1 eess.AS, cs.AI, cs.CL

arXiv PDF

📄 VSSFlow: Unifying Video-conditioned Sound and Speech Generation via Joint Learning

2025-10-01

Авторы:

Xin Cheng, Yuyue Wang, Xihua Wang, Yihan Wu, Kaisi Guan, Yijing Chen, Peng Zhang, Xiaojiang Liu, Meng Cao, Ruihua Song

#### Контекст Видео-условная генерация звука и речи (Video-conditioned Sound and Speech Generation, VSS) является ключевым направлением в искусственном интеллекте, включая задачи видео-к-звуку (V2S) и визуальной текстовой речи (Visual Text-to-Speech, VisualTTS). Однако, существующие подходы обычно рассматривают эти задачи в отдельности, не добиваясь гармоничного взаимодействия. Это приводит к неэффективности, требованию дополнительных ресурсов и усложнению обучения. Таким образом, сцепление этих задач в единую модель остается актуальной проблемой. Наша мотивация заключается в разработке модели, которая будет эффективно объединять V2S и VisualTTS в единое целое, уменьшая сложность и улучшая качество генерируемых данных. #### Метод Мы предлагаем VSSFlow — модель, основанную на методе течения (flow-matching framework). Эта модель объединяет обе задачи в единый процесс, стремясь к более эффективной интеграции условий. Основным инновационным элементом является уникальный механизм агрегации условий (condition aggregation mechanism), который позволяет эффективно обрабатывать разные типы входных данных, таких как видео и речевые транскрипты. Было выявлено, что разные слои сети (cross-attention и self-attention) демонстрируют разные индуктивные базы при вводе условий. Мы используем эти свойства для эффективного управления: cross-attention для неоднозначных видео-условий и self-attention для более определенных речевых транскриптов. Более того, нами открыто опровергнут миф о том, что усложнение модели для объединения задач приводит к ухудшению качества — VSSFlow благодаря единому циклу обучения демонстрирует более стабильный результат и ускоренное сходимость. #### Результаты Мы проводили эксперименты на задачах V2S и VisualTTS, используя стандартные наборы данных. Наши результаты показывают, что VSSFlow превосходит существующие специализированные модели, устанавливая новые рекорды качества. Особое внимание уделено выявлению преимуществ общего аудио-примитива, который ускоряет обучение, обеспечивает более точное подгонение по условиям и обеспечивает более стабильное генерирование. Эксперименты также подтверждают, что у нас предложенный подход значительно упрощает обучение и улучшает качество генерируемых данных, без дополнительных этапов обучения. #### Значимость Выделяется широкая область применений VSSFlow, включая домашние ассистенты, развлекательные приложения, медицинскую индустрию и искусственные контент-генераторы. Наш подход уникален тем, что объединяет две ранее разделенные задачи в единое решение, сокращая ресурсозатраты и улучшая качество. Преимущества заключаются в простоте развертывания, улучшенной стабильности и улуч

Annotation:

Video-conditioned sound and speech generation, encompassing video-to-sound (V2S) and visual text-to-speech (VisualTTS) tasks, are conventionally addressed as separate tasks, with limited exploration to unify them within a signle framework. Recent attempts to unify V2S and VisualTTS face challenges in handling distinct condition types (e.g., heterogeneous video and transcript conditions) and require complex training stages. Unifying these two tasks remains an open problem. To bridge this gap, we ...

ID: 2509.24773v2 eess.AS, cs.AI, cs.CL, cs.CV, cs.SD

arXiv PDF

Показано 1 - 10 из 24 записей