📊 Статистика дайджестов

Всего дайджестов: 34022 Добавлено сегодня: 82

Последнее обновление: сегодня

📄 SFMS-ALR: Script-First Multilingual Speech Synthesis with Adaptive Locale Resolution

2025-11-01

Авторы:

Dharma Teja Donepudi

Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']

Annotation:

Intra-sentence multilingual speech synthesis (code-switching TTS) remains a major challenge due to abrupt language shifts, varied scripts, and mismatched prosody between languages. Conventional TTS systems are typically monolingual and fail to produce natural, intelligible speech in mixed-language contexts. We introduce Script-First Multilingual Synthesis with Adaptive Locale Resolution (SFMS-ALR), an engine-agnostic framework for fluent, real-time code-switched speech generation. SFMS-ALR segme...

ID: 2510.25178v1 cs.SD, cs.AI, eess.AS, I.2.7; H.5.5

arXiv PDF

📄 Audio Signal Processing Using Time Domain Mel-Frequency Wavelet Coefficient

2025-10-31

Авторы:

Rinku Sebastian, Simon O'Keefe, Martin Trefzer

Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']

Annotation:

Extracting features from the speech is the most critical process in speech signal processing. Mel Frequency Cepstral Coefficients (MFCC) are the most widely used features in the majority of the speaker and speech recognition applications, as the filtering in this feature is similar to the filtering taking place in the human ear. But the main drawback of this feature is that it provides only the frequency information of the signal but does not provide the information about at what time which freq...

ID: 2510.24519v2 cs.SD, cs.AI, eess.AS

arXiv PDF

📄 Online neural fusion of distortionless differential beamformers for robust speech enhancement

2025-10-30

Авторы:

Yuanhang Qian, Kunlong Zhao, Jilu Jin, Xueqin Luo, Gongping Huang, Jingdong Chen, Jacob Benesty

Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']

Annotation:

Fixed beamforming is widely used in practice since it does not depend on the estimation of noise statistics and provides relatively stable performance. However, a single beamformer cannot adapt to varying acoustic conditions, which limits its interference suppression capability. To address this, adaptive convex combination (ACC) algorithms have been introduced, where the outputs of multiple fixed beamformers are linearly combined to improve robustness. Nevertheless, ACC often fails in highly non...

ID: 2510.24497v1 cs.SD, cs.AI, eess.AS

arXiv PDF

📄 Audio Signal Processing Using Time Domain Mel-Frequency Wavelet Coefficient

2025-10-30

Авторы:

Rinku Sebastian, Simon O'Keefe, Martin Trefzer

Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']

Annotation:

ID: 2510.24519v1 cs.SD, cs.AI, eess.AS

arXiv PDF

📄 Evaluating Multimodal Large Language Models on Core Music Perception Tasks

2025-10-29

Авторы:

Brandon James Carone, Iran R. Roman, Pablo Ripollés

Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']

Annotation:

Multimodal Large Language Models (LLMs) claim "musical understanding" via evaluations that conflate listening with score reading. We benchmark three SOTA LLMs (Gemini 2.5 Pro, Gemini 2.5 Flash, and Qwen2.5-Omni) across three core music skills: Syncopation Scoring, Transposition Detection, and Chord Quality Identification. Moreover, we separate three sources of variability: (i) perceptual limitations (audio vs. MIDI inputs), (ii) exposure to examples (zero- vs. few-shot manipulations), and (iii) ...

ID: 2510.22455v1 cs.SD, cs.AI, eess.AS

arXiv PDF

📄 Resounding Acoustic Fields with Reciprocity

2025-10-25

Авторы:

Zitong Lan, Yiduo Hao, Mingmin Zhao

Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']

Annotation:

Achieving immersive auditory experiences in virtual environments requires flexible sound modeling that supports dynamic source positions. In this paper, we introduce a task called resounding, which aims to estimate room impulse responses at arbitrary emitter location from a sparse set of measured emitter positions, analogous to the relighting problem in vision. We leverage the reciprocity property and introduce Versa, a physics-inspired approach to facilitating acoustic field learning. Our metho...

ID: 2510.20602v1 cs.SD, cs.AI, eess.AS, eess.SP

arXiv PDF

📄 R2-SVC: Towards Real-World Robust and Expressive Zero-shot Singing Voice Conversion

2025-10-25

Авторы:

Junjie Zheng, Gongyu Chen, Chaofan Ding, Zihao Chen

Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']

Annotation:

In real-world singing voice conversion (SVC) applications, environmental noise and the demand for expressive output pose significant challenges. Conventional methods, however, are typically designed without accounting for real deployment scenarios, as both training and inference usually rely on clean data. This mismatch hinders practical use, given the inevitable presence of diverse noise sources and artifacts from music separation. To tackle these issues, we propose R2-SVC, a robust and express...

ID: 2510.20677v1 cs.SD, cs.AI, eess.AS

arXiv PDF

📄 MuseTok: Symbolic Music Tokenization for Generation and Semantic Understanding

2025-10-22

Авторы:

Jingyue Huang, Zachary Novack, Phillip Long, Yupeng Hou, Ke Chen, Taylor Berg-Kirkpatrick, Julian McAuley

Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']

Annotation:

Discrete representation learning has shown promising results across various domains, including generation and understanding in image, speech and language. Inspired by these advances, we propose MuseTok, a tokenization method for symbolic music, and investigate its effectiveness in both music generation and understanding tasks. MuseTok employs the residual vector quantized-variational autoencoder (RQ-VAE) on bar-wise music segments within a Transformer-based encoder-decoder framework, producing m...

ID: 2510.16273v1 cs.SD, cs.AI, eess.AS

arXiv PDF

📄 Do Joint Language-Audio Embeddings Encode Perceptual Timbre Semantics?

2025-10-18

Авторы:

Qixin Deng, Bryan Pardo, Thrasyvoulos N Pappas

Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']

Annotation:

Understanding and modeling the relationship between language and sound is critical for applications such as music information retrieval,text-guided music generation, and audio captioning. Central to these tasks is the use of joint language-audio embedding spaces, which map textual descriptions and auditory content into a shared embedding space. While multimodal embedding models such as MS-CLAP, LAION-CLAP, and MuQ-MuLan have shown strong performance in aligning language and audio, their correspo...

ID: 2510.14249v1 cs.SD, cs.AI, eess.AS

arXiv PDF

📄 Big Data Approaches to Bovine Bioacoustics: A FAIR-Compliant Dataset and Scalable ML Framework for Precision Livestock Welfare

2025-10-18

Авторы:

Mayuri Kate, Suresh Neethirajan

Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']

Annotation:

The convergence of IoT sensing, edge computing, and machine learning is transforming precision livestock farming. Yet bioacoustic data streams remain underused because of computational complexity and ecological validity challenges. We present one of the most comprehensive bovine vocalization datasets to date, with 569 curated clips covering 48 behavioral classes, recorded across three commercial dairy farms using multiple microphone arrays and expanded to 2900 samples through domain informed aug...

ID: 2510.14443v1 cs.SD, cs.AI, eess.AS

arXiv PDF

Показано 11 - 20 из 69 записей