📊 Статистика дайджестов

Всего дайджестов: 34022 Добавлено сегодня: 82

Последнее обновление: сегодня

📄 Model-Guided Dual-Role Alignment for High-Fidelity Open-Domain Video-to-Audio Generation

2025-10-30

Авторы:

Kang Zhang, Trung X. Pham, Suyeon Lee, Axi Niu, Arda Senocak, Joon Son Chung

Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']

Annotation:

We present MGAudio, a novel flow-based framework for open-domain video-to-audio generation, which introduces model-guided dual-role alignment as a central design principle. Unlike prior approaches that rely on classifier-based or classifier-free guidance, MGAudio enables the generative model to guide itself through a dedicated training objective designed for video-conditioned audio generation. The framework integrates three main components: (1) a scalable flow-based Transformer model, (2) a dual...

ID: 2510.24103v1 cs.SD, cs.AI, cs.MM, eess.AS

arXiv PDF

📄 Online neural fusion of distortionless differential beamformers for robust speech enhancement

2025-10-30

Авторы:

Yuanhang Qian, Kunlong Zhao, Jilu Jin, Xueqin Luo, Gongping Huang, Jingdong Chen, Jacob Benesty

Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']

Annotation:

Fixed beamforming is widely used in practice since it does not depend on the estimation of noise statistics and provides relatively stable performance. However, a single beamformer cannot adapt to varying acoustic conditions, which limits its interference suppression capability. To address this, adaptive convex combination (ACC) algorithms have been introduced, where the outputs of multiple fixed beamformers are linearly combined to improve robustness. Nevertheless, ACC often fails in highly non...

ID: 2510.24497v1 cs.SD, cs.AI, eess.AS

arXiv PDF

📄 Audio Signal Processing Using Time Domain Mel-Frequency Wavelet Coefficient

2025-10-30

Авторы:

Rinku Sebastian, Simon O'Keefe, Martin Trefzer

Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']

Annotation:

Extracting features from the speech is the most critical process in speech signal processing. Mel Frequency Cepstral Coefficients (MFCC) are the most widely used features in the majority of the speaker and speech recognition applications, as the filtering in this feature is similar to the filtering taking place in the human ear. But the main drawback of this feature is that it provides only the frequency information of the signal but does not provide the information about at what time which freq...

ID: 2510.24519v1 cs.SD, cs.AI, eess.AS

arXiv PDF

📄 PromptReverb: Multimodal Room Impulse Response Generation Through Latent Rectified Flow Matching

2025-10-29

Авторы:

Ali Vosoughi, Yongyi Zang, Qihui Yang, Nathan Peak, Randal Leistikow, Chenliang Xu

Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']

Annotation:

Room impulse response (RIR) generation remains a critical challenge for creating immersive virtual acoustic environments. Current methods suffer from two fundamental limitations: the scarcity of full-band RIR datasets and the inability of existing models to generate acoustically accurate responses from diverse input modalities. We present PromptReverb, a two-stage generative framework that addresses these challenges. Our approach combines a variational autoencoder that upsamples band-limited RIR...

ID: 2510.22439v1 cs.SD, cs.AI, I.2.6, H.5.5

arXiv PDF

📄 Evaluating Multimodal Large Language Models on Core Music Perception Tasks

2025-10-29

Авторы:

Brandon James Carone, Iran R. Roman, Pablo Ripollés

Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']

Annotation:

Multimodal Large Language Models (LLMs) claim "musical understanding" via evaluations that conflate listening with score reading. We benchmark three SOTA LLMs (Gemini 2.5 Pro, Gemini 2.5 Flash, and Qwen2.5-Omni) across three core music skills: Syncopation Scoring, Transposition Detection, and Chord Quality Identification. Moreover, we separate three sources of variability: (i) perceptual limitations (audio vs. MIDI inputs), (ii) exposure to examples (zero- vs. few-shot manipulations), and (iii) ...

ID: 2510.22455v1 cs.SD, cs.AI, eess.AS

arXiv PDF

📄 Learning Linearity in Audio Consistency Autoencoders via Implicit Regularization

2025-10-29

Авторы:

Bernardo Torres, Manuel Moussallam, Gabriel Meseguer-Brocal

Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']

Annotation:

Audio autoencoders learn useful, compressed audio representations, but their non-linear latent spaces prevent intuitive algebraic manipulation such as mixing or scaling. We introduce a simple training methodology to induce linearity in a high-compression Consistency Autoencoder (CAE) by using data augmentation, thereby inducing homogeneity (equivariance to scalar gain) and additivity (the decoder preserves addition) without altering the model's architecture or loss function. When trained with ou...

ID: 2510.23530v1 cs.SD, cs.AI, cs.LG, eess.AS

arXiv PDF

📄 UniSE: A Unified Framework for Decoder-only Autoregressive LM-based Speech Enhancement

2025-10-25

Авторы:

Haoyin Yan, Chengwei Liu, Shaofei Xue, Xiaotao Liang, Zheng Xue

Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']

Annotation:

The development of neural audio codecs (NACs) has largely promoted applications of language models (LMs) to speech processing and understanding. However, there lacks the verification on the effectiveness of autoregressive (AR) LMbased models in unifying different sub-tasks of speech enhancement (SE). In this work, we propose UniSE, a unified decoder-only LM-based framework to handle different SE tasks including speech restoration, target speaker extraction and speech separation. It takes input s...

ID: 2510.20441v1 cs.SD, cs.AI

arXiv PDF

📄 Resounding Acoustic Fields with Reciprocity

2025-10-25

Авторы:

Zitong Lan, Yiduo Hao, Mingmin Zhao

Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']

Annotation:

Achieving immersive auditory experiences in virtual environments requires flexible sound modeling that supports dynamic source positions. In this paper, we introduce a task called resounding, which aims to estimate room impulse responses at arbitrary emitter location from a sparse set of measured emitter positions, analogous to the relighting problem in vision. We leverage the reciprocity property and introduce Versa, a physics-inspired approach to facilitating acoustic field learning. Our metho...

ID: 2510.20602v1 cs.SD, cs.AI, eess.AS, eess.SP

arXiv PDF

📄 R2-SVC: Towards Real-World Robust and Expressive Zero-shot Singing Voice Conversion

2025-10-25

Авторы:

Junjie Zheng, Gongyu Chen, Chaofan Ding, Zihao Chen

Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']

Annotation:

In real-world singing voice conversion (SVC) applications, environmental noise and the demand for expressive output pose significant challenges. Conventional methods, however, are typically designed without accounting for real deployment scenarios, as both training and inference usually rely on clean data. This mismatch hinders practical use, given the inevitable presence of diverse noise sources and artifacts from music separation. To tackle these issues, we propose R2-SVC, a robust and express...

ID: 2510.20677v1 cs.SD, cs.AI, eess.AS

arXiv PDF

📄 MuseTok: Symbolic Music Tokenization for Generation and Semantic Understanding

2025-10-22

Авторы:

Jingyue Huang, Zachary Novack, Phillip Long, Yupeng Hou, Ke Chen, Taylor Berg-Kirkpatrick, Julian McAuley

Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']

Annotation:

Discrete representation learning has shown promising results across various domains, including generation and understanding in image, speech and language. Inspired by these advances, we propose MuseTok, a tokenization method for symbolic music, and investigate its effectiveness in both music generation and understanding tasks. MuseTok employs the residual vector quantized-variational autoencoder (RQ-VAE) on bar-wise music segments within a Transformer-based encoder-decoder framework, producing m...

ID: 2510.16273v1 cs.SD, cs.AI, eess.AS

arXiv PDF

Показано 61 - 70 из 274 записей