📊 Статистика дайджестов

Всего дайджестов: 34022 Добавлено сегодня: 82

Последнее обновление: сегодня

📄 Schrödinger Bridge Mamba for One-Step Speech Enhancement

2025-10-22

Авторы:

Jing Yang, Sirui Wang, Chao Wu, Fan Fan

Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']

Annotation:

We propose Schr\"odinger Bridge Mamba (SBM), a new concept of training-inference framework motivated by the inherent compatibility between Schr\"odinger Bridge (SB) training paradigm and selective state-space model Mamba. We exemplify the concept of SBM with an implementation for generative speech enhancement. Experiments on a joint denoising and dereverberation task using four benchmark datasets demonstrate that SBM, with only 1-step inference, outperforms strong baselines with 1-step or iterat...

ID: 2510.16834v1 cs.SD, cs.AI, cs.LG, eess.AS

arXiv PDF

📄 Investigating Safety Vulnerabilities of Large Audio-Language Models Under Speaker Emotional Variations

2025-10-22

Авторы:

Bo-Han Feng, Chien-Feng Liu, Yu-Hsuan Li Liang, Chih-Kai Yang, Szu-Wei Fu, Zhehuai Chen, Ke-Han Lu, Sung-Feng Huang, Chao-Han Huck Yang, Yu-Chiang Frank Wang, Yun-Nung Chen, Hung-yi Lee

Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']

Annotation:

Large audio-language models (LALMs) extend text-based LLMs with auditory understanding, offering new opportunities for multimodal applications. While their perception, reasoning, and task performance have been widely studied, their safety alignment under paralinguistic variation remains underexplored. This work systematically investigates the role of speaker emotion. We construct a dataset of malicious speech instructions expressed across multiple emotions and intensities, and evaluate several s...

ID: 2510.16893v1 cs.SD, cs.AI, cs.CL, eess.AS

arXiv PDF

📄 SAKE: Towards Editing Auditory Attribute Knowledge of Large Audio-Language Models

2025-10-22

Авторы:

Chih-Kai Yang, Yen-Ting Piao, Tzu-Wen Hsu, Szu-Wei Fu, Zhehuai Chen, Ke-Han Lu, Sung-Feng Huang, Chao-Han Huck Yang, Yu-Chiang Frank Wang, Yun-Nung Chen, Hung-yi Lee

Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']

Annotation:

Knowledge editing offers an efficient way to update model knowledge without full retraining, but prior work has concentrated almost exclusively on textual or visual modalities. We introduce SAKE, the first benchmark specifically designed for editing auditory attribute knowledge in Large Audio-Language Models (LALMs). Unlike factual updates, SAKE targets several abstract auditory attributes, capturing knowledge types that go beyond conventional textual and visual domains. We benchmark seven editi...

ID: 2510.16917v1 cs.SD, cs.AI, cs.CL, eess.AS

arXiv PDF

📄 DDSC: Dynamic Dual-Signal Curriculum for Data-Efficient Acoustic Scene Classification under Domain Shift

2025-10-22

Авторы:

Peihong Zhang, Yuxuan Liu, Rui Sang, Zhixin Li, Yiqiang Cai, Yizhou Tan, Shengchen Li

Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']

Annotation:

Acoustic scene classification (ASC) suffers from device-induced domain shift, especially when labels are limited. Prior work focuses on curriculum-based training schedules that structure data presentation by ordering or reweighting training examples from easy-to-hard to facilitate learning; however, existing curricula are static, fixing the ordering or the weights before training and ignoring that example difficulty and marginal utility evolve with the learned representation. To overcome this li...

ID: 2510.17345v1 cs.SD, cs.AI

arXiv PDF

📄 TopSeg: A Multi-Scale Topological Framework for Data-Efficient Heart Sound Segmentation

2025-10-22

Авторы:

Peihong Zhang, Zhixin Li, Yuxuan Liu, Rui Sang, Yiqiang Cai, Yizhou Tan, Shengchen Li

Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']

Annotation:

Deep learning approaches for heart-sound (PCG) segmentation built on time--frequency features can be accurate but often rely on large expert-labeled datasets, limiting robustness and deployment. We present TopSeg, a topological representation-centric framework that encodes PCG dynamics with multi-scale topological features and decodes them using a lightweight temporal convolutional network (TCN) with an order- and duration-constrained inference step. To evaluate data efficiency and generalizatio...

ID: 2510.17346v1 cs.SD, cs.AI

arXiv PDF

📄 SpikeVox: Towards Energy-Efficient Speech Therapy Framework with Spike-driven Generative Language Models

2025-10-21

Авторы:

Rachmad Vidya Wicaksana Putra, Aadithyan Rajesh Nair, Muhammad Shafique

Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']

Annotation:

Speech disorders can significantly affect the patients capability to communicate, learn, and socialize. However, existing speech therapy solutions (e.g., therapist or tools) are still limited and costly, hence such solutions remain inadequate for serving millions of patients worldwide. To address this, state-of-the-art methods employ neural network (NN) algorithms to help accurately detecting speech disorders. However, these methods do not provide therapy recommendation as feedback, hence provid...

ID: 2510.15566v1 cs.SD, cs.AI, cs.NE

arXiv PDF

📄 Beat Tracking as Object Detection

2025-10-20

Авторы:

Jaehoon Ahn, Moon-Ryul Jung

Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']

Annotation:

Recent beat and downbeat tracking models (e.g., RNNs, TCNs, Transformers) output frame-level activations. We propose reframing this task as object detection, where beats and downbeats are modeled as temporal "objects." Adapting the FCOS detector from computer vision to 1D audio, we replace its original backbone with WaveBeat's temporal feature extractor and add a Feature Pyramid Network to capture multi-scale temporal patterns. The model predicts overlapping beat/downbeat intervals with confiden...

ID: 2510.14391v2 cs.SD, cs.AI, cs.LG

arXiv PDF

📄 Do Joint Language-Audio Embeddings Encode Perceptual Timbre Semantics?

2025-10-18

Авторы:

Qixin Deng, Bryan Pardo, Thrasyvoulos N Pappas

Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']

Annotation:

Understanding and modeling the relationship between language and sound is critical for applications such as music information retrieval,text-guided music generation, and audio captioning. Central to these tasks is the use of joint language-audio embedding spaces, which map textual descriptions and auditory content into a shared embedding space. While multimodal embedding models such as MS-CLAP, LAION-CLAP, and MuQ-MuLan have shown strong performance in aligning language and audio, their correspo...

ID: 2510.14249v1 cs.SD, cs.AI, eess.AS

arXiv PDF

📄 Beat Detection as Object Detection

2025-10-18

Авторы:

Jaehoon Ahn, Moon-Ryul Jung

Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']

Annotation:

ID: 2510.14391v1 cs.SD, cs.AI, cs.LG

arXiv PDF

📄 Big Data Approaches to Bovine Bioacoustics: A FAIR-Compliant Dataset and Scalable ML Framework for Precision Livestock Welfare

2025-10-18

Авторы:

Mayuri Kate, Suresh Neethirajan

Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']

Annotation:

The convergence of IoT sensing, edge computing, and machine learning is transforming precision livestock farming. Yet bioacoustic data streams remain underused because of computational complexity and ecological validity challenges. We present one of the most comprehensive bovine vocalization datasets to date, with 569 curated clips covering 48 behavioral classes, recorded across three commercial dairy farms using multiple microphone arrays and expanded to 2900 samples through domain informed aug...

ID: 2510.14443v1 cs.SD, cs.AI, eess.AS

arXiv PDF

Показано 71 - 80 из 274 записей