📊 Статистика дайджестов
Всего дайджестов: 34022 Добавлено сегодня: 82
Последнее обновление: сегодня
Авторы:
Dharma Teja Donepudi
Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']
Annotation:
Intra-sentence multilingual speech synthesis (code-switching TTS) remains a
major challenge due to abrupt language shifts, varied scripts, and mismatched
prosody between languages. Conventional TTS systems are typically monolingual
and fail to produce natural, intelligible speech in mixed-language contexts. We
introduce Script-First Multilingual Synthesis with Adaptive Locale Resolution
(SFMS-ALR), an engine-agnostic framework for fluent, real-time code-switched
speech generation. SFMS-ALR segme...
Авторы:
Rinku Sebastian, Simon O'Keefe, Martin Trefzer
Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']
Annotation:
Extracting features from the speech is the most critical process in speech
signal processing. Mel Frequency Cepstral Coefficients (MFCC) are the most
widely used features in the majority of the speaker and speech recognition
applications, as the filtering in this feature is similar to the filtering
taking place in the human ear. But the main drawback of this feature is that it
provides only the frequency information of the signal but does not provide the
information about at what time which freq...
📄 Online neural fusion of distortionless differential beamformers for robust speech enhancement
2025-10-30Авторы:
Yuanhang Qian, Kunlong Zhao, Jilu Jin, Xueqin Luo, Gongping Huang, Jingdong Chen, Jacob Benesty
Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']
Annotation:
Fixed beamforming is widely used in practice since it does not depend on the
estimation of noise statistics and provides relatively stable performance.
However, a single beamformer cannot adapt to varying acoustic conditions, which
limits its interference suppression capability. To address this, adaptive
convex combination (ACC) algorithms have been introduced, where the outputs of
multiple fixed beamformers are linearly combined to improve robustness.
Nevertheless, ACC often fails in highly non...
Авторы:
Rinku Sebastian, Simon O'Keefe, Martin Trefzer
Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']
Annotation:
Extracting features from the speech is the most critical process in speech
signal processing. Mel Frequency Cepstral Coefficients (MFCC) are the most
widely used features in the majority of the speaker and speech recognition
applications, as the filtering in this feature is similar to the filtering
taking place in the human ear. But the main drawback of this feature is that it
provides only the frequency information of the signal but does not provide the
information about at what time which freq...
Авторы:
Brandon James Carone, Iran R. Roman, Pablo Ripollés
Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']
Annotation:
Multimodal Large Language Models (LLMs) claim "musical understanding" via
evaluations that conflate listening with score reading. We benchmark three SOTA
LLMs (Gemini 2.5 Pro, Gemini 2.5 Flash, and Qwen2.5-Omni) across three core
music skills: Syncopation Scoring, Transposition Detection, and Chord Quality
Identification. Moreover, we separate three sources of variability: (i)
perceptual limitations (audio vs. MIDI inputs), (ii) exposure to examples
(zero- vs. few-shot manipulations), and (iii) ...
Авторы:
Zitong Lan, Yiduo Hao, Mingmin Zhao
Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']
Annotation:
Achieving immersive auditory experiences in virtual environments requires
flexible sound modeling that supports dynamic source positions. In this paper,
we introduce a task called resounding, which aims to estimate room impulse
responses at arbitrary emitter location from a sparse set of measured emitter
positions, analogous to the relighting problem in vision. We leverage the
reciprocity property and introduce Versa, a physics-inspired approach to
facilitating acoustic field learning. Our metho...
Авторы:
Junjie Zheng, Gongyu Chen, Chaofan Ding, Zihao Chen
Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']
Annotation:
In real-world singing voice conversion (SVC) applications, environmental
noise and the demand for expressive output pose significant challenges.
Conventional methods, however, are typically designed without accounting for
real deployment scenarios, as both training and inference usually rely on clean
data. This mismatch hinders practical use, given the inevitable presence of
diverse noise sources and artifacts from music separation. To tackle these
issues, we propose R2-SVC, a robust and express...
Авторы:
Jingyue Huang, Zachary Novack, Phillip Long, Yupeng Hou, Ke Chen, Taylor Berg-Kirkpatrick, Julian McAuley
Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']
Annotation:
Discrete representation learning has shown promising results across various
domains, including generation and understanding in image, speech and language.
Inspired by these advances, we propose MuseTok, a tokenization method for
symbolic music, and investigate its effectiveness in both music generation and
understanding tasks. MuseTok employs the residual vector quantized-variational
autoencoder (RQ-VAE) on bar-wise music segments within a Transformer-based
encoder-decoder framework, producing m...
Авторы:
Qixin Deng, Bryan Pardo, Thrasyvoulos N Pappas
Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']
Annotation:
Understanding and modeling the relationship between language and sound is
critical for applications such as music information retrieval,text-guided music
generation, and audio captioning. Central to these tasks is the use of joint
language-audio embedding spaces, which map textual descriptions and auditory
content into a shared embedding space. While multimodal embedding models such
as MS-CLAP, LAION-CLAP, and MuQ-MuLan have shown strong performance in aligning
language and audio, their correspo...
Авторы:
Mayuri Kate, Suresh Neethirajan
Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']
Annotation:
The convergence of IoT sensing, edge computing, and machine learning is
transforming precision livestock farming. Yet bioacoustic data streams remain
underused because of computational complexity and ecological validity
challenges. We present one of the most comprehensive bovine vocalization
datasets to date, with 569 curated clips covering 48 behavioral classes,
recorded across three commercial dairy farms using multiple microphone arrays
and expanded to 2900 samples through domain informed aug...
Показано 11 -
20
из 69 записей