📊 Статистика дайджестов

Всего дайджестов: 34022 Добавлено сегодня: 0

Последнее обновление: сегодня

📄 GLA-Grad++: An Improved Griffin-Lim Guided Diffusion Model for Speech Synthesis

2025-12-02

Авторы:

Teysir Baoueb, Xiaoyu Bie, Mathieu Fontaine, Gaël Richard

Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']

Annotation:

Recent advances in diffusion models have positioned them as powerful generative frameworks for speech synthesis, demonstrating substantial improvements in audio quality and stability. Nevertheless, their effectiveness in vocoders conditioned on mel spectrograms remains constrained, particularly when the conditioning diverges from the training distribution. The recently proposed GLA-Grad model introduced a phase-aware extension to the WaveGrad vocoder that integrated the Griffin-Lim algorithm (GL...

ID: 2511.22293v1 cs.SD, cs.LG, eess.AS, eess.SP

arXiv PDF

📄 Fine-tuning Pre-trained Audio Models for COVID-19 Detection: A Technical Report

2025-11-21

Авторы:

Daniel Oliveira de Brito, Letícia Gabriella de Souza, Marcelo Matheus Gauy, Marcelo Finger, Arnaldo Candido Junior

Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']

Annotation:

This technical report investigates the performance of pre-trained audio models on COVID-19 detection tasks using established benchmark datasets. We fine-tuned Audio-MAE and three PANN architectures (CNN6, CNN10, CNN14) on the Coswara and COUGHVID datasets, evaluating both intra-dataset and cross-dataset generalization. We implemented a strict demographic stratification by age and gender to prevent models from exploiting spurious correlations between demographic characteristics and COVID-19 statu...

ID: 2511.14939v1 cs.SD, cs.LG, eess.AS

arXiv PDF

📄 Transformer Redesign for Late Fusion of Audio-Text Features on Ultra-Low-Power Edge Hardware

2025-10-23

Авторы:

Stavros Mitsis, Ermos Hadjikyriakos, Humaid Ibrahim, Savvas Neofytou, Shashwat Raman, James Myles, Eiman Kanjo

Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']

Annotation:

Deploying emotion recognition systems in real-world environments where devices must be small, low-power, and private remains a significant challenge. This is especially relevant for applications such as tension monitoring, conflict de-escalation, and responsive wearables, where cloud-based solutions are impractical. Multimodal emotion recognition has advanced through deep learning, but most systems remain unsuitable for deployment on ultra-constrained edge devices. Prior work typically relies on...

ID: 2510.18036v1 cs.SD, cs.LG, eess.AS

arXiv PDF

📄 ProGress: Structured Music Generation via Graph Diffusion and Hierarchical Music Analysis

2025-10-16

Авторы:

Stephen Ni-Hahn, Chao Péter Yang, Mingchen Ma, Cynthia Rudin, Simon Mak, Yue Jiang

Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']

Annotation:

Artificial Intelligence (AI) for music generation is undergoing rapid developments, with recent symbolic models leveraging sophisticated deep learning and diffusion model algorithms. One drawback with existing models is that they lack structural cohesion, particularly on harmonic-melodic structure. Furthermore, such existing models are largely "black-box" in nature and are not musically interpretable. This paper addresses these limitations via a novel generative music framework that incorporates...

ID: 2510.10249v1 cs.SD, cs.LG, eess.AS

arXiv PDF

📄 BACHI: Boundary-Aware Symbolic Chord Recognition Through Masked Iterative Decoding on Pop and Classical Music

2025-10-12

Авторы:

Mingyang Yao, Ke Chen, Shlomo Dubnov, Taylor Berg-Kirkpatrick

Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']

Annotation:

Automatic chord recognition (ACR) via deep learning models has gradually achieved promising recognition accuracy, yet two key challenges remain. First, prior work has primarily focused on audio-domain ACR, while symbolic music (e.g., score) ACR has received limited attention due to data scarcity. Second, existing methods still overlook strategies that are aligned with human music analytical practices. To address these challenges, we make two contributions: (1) we introduce POP909-CL, an enhanced...

ID: 2510.06528v1 cs.SD, cs.LG, eess.AS

arXiv PDF

📄 Transcribing Rhythmic Patterns of the Guitar Track in Polyphonic Music

2025-10-09

Авторы:

Aleksandr Lukoianov, Anssi Klapuri

Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']

Annotation:

Whereas chord transcription has received considerable attention during the past couple of decades, far less work has been devoted to transcribing and encoding the rhythmic patterns that occur in a song. The topic is especially relevant for instruments such as the rhythm guitar, which is typically played by strumming rhythmic patterns that repeat and vary over time. However, in many cases one cannot objectively define a single "right" rhythmic pattern for a given song section. To create a dataset...

ID: 2510.05756v1 cs.SD, cs.LG, eess.AS

arXiv PDF

📄 Modulation Discovery with Differentiable Digital Signal Processing

2025-10-09

Авторы:

Christopher Mitcheltree, Hao Hao Tan, Joshua D. Reiss

Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']

Annotation:

Modulations are a critical part of sound design and music production, enabling the creation of complex and evolving audio. Modern synthesizers provide envelopes, low frequency oscillators (LFOs), and more parameter automation tools that allow users to modulate the output with ease. However, determining the modulation signals used to create a sound is difficult, and existing sound-matching / parameter estimation systems are often uninterpretable black boxes or predict high-dimensional framewise p...

ID: 2510.06204v1 cs.SD, cs.LG, eess.AS

arXiv PDF

📄 Lightweight and Generalizable Acoustic Scene Representations via Contrastive Fine-Tuning and Distillation

2025-10-08

Авторы:

Kuang Yuan, Yang Gao, Xilin Li, Xinhao Mei, Syavosh Zadissa, Tarun Pruthi, Saeed Bagheri Sereshki

Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']

Annotation:

Acoustic scene classification (ASC) models on edge devices typically operate under fixed class assumptions, lacking the transferability needed for real-world applications that require adaptation to new or refined acoustic categories. We propose ContrastASC, which learns generalizable acoustic scene representations by structuring the embedding space to preserve semantic relationships between scenes, enabling adaptation to unseen categories without retraining. Our approach combines supervised cont...

ID: 2510.03728v1 cs.SD, cs.LG, eess.AS, eess.SP

arXiv PDF

📄 Multi-bit Audio Watermarking

2025-10-04

Авторы:

Luca A. Lanzendörfer, Kyle Fearne, Florian Grötschla, Roger Wattenhofer

Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']

Annotation:

We present Timbru, a post-hoc audio watermarking model that achieves state-of-the-art robustness and imperceptibility trade-offs without training an embedder-detector model. Given any 44.1 kHz stereo music snippet, our method performs per-audio gradient optimization to add imperceptible perturbations in the latent space of a pretrained audio VAE, guided by a combined message and perceptual loss. The watermark can then be extracted using a pretrained CLAP model. We evaluate 16-bit watermarking on...

ID: 2510.01968v1 cs.SD, cs.LG, eess.AS

arXiv PDF

📄 SoundReactor: Frame-level Online Video-to-Audio Generation

2025-10-04

Авторы:

Koichi Saito, Julian Tanke, Christian Simon, Masato Ishii, Kazuki Shimada, Zachary Novack, Zhi Zhong, Akio Hayakawa, Takashi Shibuya, Yuki Mitsufuji

Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']

Annotation:

Prevailing Video-to-Audio (V2A) generation models operate offline, assuming an entire video sequence or chunks of frames are available beforehand. This critically limits their use in interactive applications such as live content creation and emerging generative world models. To address this gap, we introduce the novel task of frame-level online V2A generation, where a model autoregressively generates audio from video without access to future video frames. Furthermore, we propose SoundReactor, wh...

ID: 2510.02110v1 cs.SD, cs.LG, eess.AS

arXiv PDF

Показано 1 - 10 из 30 записей