📊 Статистика дайджестов
Всего дайджестов: 34022 Добавлено сегодня: 82
Последнее обновление: сегодня
📄 ProGress: Structured Music Generation via Graph Diffusion and Hierarchical Music Analysis
2025-10-16Авторы:
Stephen Ni-Hahn, Chao Péter Yang, Mingchen Ma, Cynthia Rudin, Simon Mak, Yue Jiang
Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']
Annotation:
Artificial Intelligence (AI) for music generation is undergoing rapid
developments, with recent symbolic models leveraging sophisticated deep
learning and diffusion model algorithms. One drawback with existing models is
that they lack structural cohesion, particularly on harmonic-melodic structure.
Furthermore, such existing models are largely "black-box" in nature and are not
musically interpretable. This paper addresses these limitations via a novel
generative music framework that incorporates...
Авторы:
Liyang Chen, Hongkai Chen, Yujun Cai, Sifan Li, Qingwen Ye, Yiwei Wang
Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']
Annotation:
Video-to-Audio generation has made remarkable strides in automatically
synthesizing sound for video. However, existing evaluation metrics, which focus
on semantic and temporal alignment, overlook a critical failure mode: models
often generate acoustic events, particularly speech and music, that have no
corresponding visual source. We term this phenomenon Insertion Hallucination
and identify it as a systemic risk driven by dataset biases, such as the
prevalence of off-screen sounds, that remains ...
Авторы:
Mingyang Yao, Ke Chen, Shlomo Dubnov, Taylor Berg-Kirkpatrick
Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']
Annotation:
Automatic chord recognition (ACR) via deep learning models has gradually
achieved promising recognition accuracy, yet two key challenges remain. First,
prior work has primarily focused on audio-domain ACR, while symbolic music
(e.g., score) ACR has received limited attention due to data scarcity. Second,
existing methods still overlook strategies that are aligned with human music
analytical practices. To address these challenges, we make two contributions:
(1) we introduce POP909-CL, an enhanced...
Авторы:
Liyang Chen, Hongkai Chen, Yujun Cai, Sifan Li, Qingwen Ye, Yiwei Wang
Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']
Annotation:
Video-to-Audio generation has made remarkable strides in automatically
synthesizing sound for video. However, existing evaluation metrics, which focus
on semantic and temporal alignment, overlook a critical failure mode: models
often generate acoustic events, particularly speech and music, that have no
corresponding visual source. We term this phenomenon Insertion Hallucination
and identify it as a systemic risk driven by dataset biases, such as the
prevalence of off-screen sounds, that remains ...
Авторы:
Aleksandr Lukoianov, Anssi Klapuri
Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']
Annotation:
Whereas chord transcription has received considerable attention during the
past couple of decades, far less work has been devoted to transcribing and
encoding the rhythmic patterns that occur in a song. The topic is especially
relevant for instruments such as the rhythm guitar, which is typically played
by strumming rhythmic patterns that repeat and vary over time. However, in many
cases one cannot objectively define a single "right" rhythmic pattern for a
given song section. To create a dataset...
Авторы:
Akshay Muppidi, Martin Radfar
Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']
Annotation:
Speech emotion recognition (SER) is pivotal for enhancing human-machine
interactions. This paper introduces "EmoHRNet", a novel adaptation of
High-Resolution Networks (HRNet) tailored for SER. The HRNet structure is
designed to maintain high-resolution representations from the initial to the
final layers. By transforming audio samples into spectrograms, EmoHRNet
leverages the HRNet architecture to extract high-level features. EmoHRNet's
unique architecture maintains high-resolution representatio...
Авторы:
Christopher Mitcheltree, Hao Hao Tan, Joshua D. Reiss
Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']
Annotation:
Modulations are a critical part of sound design and music production,
enabling the creation of complex and evolving audio. Modern synthesizers
provide envelopes, low frequency oscillators (LFOs), and more parameter
automation tools that allow users to modulate the output with ease. However,
determining the modulation signals used to create a sound is difficult, and
existing sound-matching / parameter estimation systems are often
uninterpretable black boxes or predict high-dimensional framewise p...
Авторы:
Kuang Yuan, Yang Gao, Xilin Li, Xinhao Mei, Syavosh Zadissa, Tarun Pruthi, Saeed Bagheri Sereshki
Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']
Annotation:
Acoustic scene classification (ASC) models on edge devices typically operate
under fixed class assumptions, lacking the transferability needed for
real-world applications that require adaptation to new or refined acoustic
categories. We propose ContrastASC, which learns generalizable acoustic scene
representations by structuring the embedding space to preserve semantic
relationships between scenes, enabling adaptation to unseen categories without
retraining. Our approach combines supervised cont...
Авторы:
Juncheng Wang, Chao Xu, Cheng Yu, Zhe Hu, Haoyu Xie, Guoqi Yu, Lei Shang, Shujun Wang
Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']
Annotation:
While language models (LMs) paired with residual vector quantization (RVQ)
tokenizers have shown promise in text-to-audio (T2A) generation, they still lag
behind diffusion-based models by a non-trivial margin. We identify a critical
dilemma underpinning this gap: incorporating more RVQ layers improves audio
reconstruction fidelity but exceeds the generation capacity of conventional
LMs. To address this, we first analyze RVQ dynamics and uncover two key
limitations: 1) orthogonality of features a...
Авторы:
Joann Ching, Gerhard Widmer
Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']
Annotation:
Music Emotion Recognition (MER) is a task deeply connected to human
perception, relying heavily on subjective annotations collected from
contributors. Prior studies tend to focus on specific musical styles rather
than incorporating a diverse range of genres, such as rock and classical,
within a single framework. In this paper, we address the task of recognizing
emotion from audio content by investigating five datasets with dimensional
emotion annotations -- EmoMusic, DEAM, PMEmo, WTC, and WCMED ...
Показано 21 -
30
из 80 записей