📊 Статистика дайджестов
Всего дайджестов: 34022 Добавлено сегодня: 82
Последнее обновление: сегодня
Авторы:
Xuanchen Wang, Heng Wang, Weidong Cai
Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']
Annotation:
Music is both an auditory and an embodied phenomenon, closely linked to human
motion and naturally expressed through dance. However, most existing audio
representations neglect this embodied dimension, limiting their ability to
capture rhythmic and structural cues that drive movement. We propose
MotionBeat, a framework for motion-aligned music representation learning.
MotionBeat is trained with two newly proposed objectives: the Embodied
Contrastive Loss (ECL), an enhanced InfoNCE formulation wi...
📄 ParsVoice: A Large-Scale Multi-Speaker Persian Speech Corpus for Text-to-Speech Synthesis
2025-10-16Авторы:
Mohammad Javad Ranjbar Kalahroodi, Heshaam Faili, Azadeh Shakery
Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']
Annotation:
Existing Persian speech datasets are typically smaller than their English
counterparts, which creates a key limitation for developing Persian speech
technologies. We address this gap by introducing ParsVoice, the largest Persian
speech corpus designed specifically for text-to-speech(TTS) applications. We
created an automated pipeline that transforms raw audiobook content into
TTS-ready data, incorporating components such as a BERT-based sentence
completion detector, a binary search boundary opti...
Авторы:
Xuyao Deng, Yanjie Sun, Yong Dou, Kele Xu
Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']
Annotation:
Scaling laws have profoundly shaped our understanding of model performance in
computer vision and natural language processing, yet their application to
general audio representation learning remains underexplored. A key challenge
lies in the multifactorial nature of general audio
representation-representation quality is jointly influenced by variables such
as audio length, embedding dimensionality, model depth, model architecture,
data volume, etc., many of which are difficult to isolate or expre...
Авторы:
Yi Wang, Yinfeng Yu, Fuchun Sun, Liejun Wang, Wendong Zheng
Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']
Annotation:
Audio-Visual Embodied Navigation aims to enable agents to autonomously
navigate to sound sources in unknown 3D environments using auditory cues. While
current AVN methods excel on in-distribution sound sources, they exhibit poor
cross-source generalization: navigation success rates plummet and search paths
become excessively long when agents encounter unheard sounds or unseen
environments. This limitation stems from the lack of explicit alignment
mechanisms between auditory signals and correspon...
📄 TFGA-Net: Temporal-Frequency Graph Attention Network for Brain-Controlled Speaker Extraction
2025-10-16Авторы:
Youhao Si, Yuan Liao, Qiushi Han, Yuhang Yang, Rui Dai, Liya Huang
Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']
Annotation:
The rapid development of auditory attention decoding (AAD) based on
electroencephalography (EEG) signals offers the possibility EEG-driven target
speaker extraction. However, how to effectively utilize the target-speaker
common information between EEG and speech remains an unresolved problem. In
this paper, we propose a model for brain-controlled speaker extraction, which
utilizes the EEG recorded from the listener to extract the target speech. In
order to effectively extract information from EE...
Авторы:
KiHyun Nam, Jongmin Choi, Hyeongkeun Lee, Jungwoo Heo, Joon Son Chung
Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']
Annotation:
Contrastive audio-language pretraining yields powerful joint representations,
yet a persistent audio-text modality gap limits the benefits of coupling
multimodal encoders with large language models (LLMs). We present
Diffusion-Link, a diffusion-based modality-bridging module that generatively
maps audio embeddings into the text-embedding distribution. The module is
trained at the output embedding from the frozen multimodal encoder and
implemented as a lightweight network with three residual MLP ...
Авторы:
Kuan-Yi Lee, Tsung-En Lin, Hung-Yi Lee
Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']
Annotation:
Recent advancements in large multimodal models (LMMs) have shown strong
capabilities in audio understanding. However, most systems rely solely on
end-to-end reasoning, limiting interpretability and accuracy for tasks that
require structured knowledge or specialized signal analysis. In this work, we
present Audio-Maestro -- a tool-augmented audio reasoning framework that
enables audio-language models to autonomously call external tools and integrate
their timestamped outputs into the reasoning pr...
Авторы:
Alain Riou, Joan Serrà, Yuki Mitsufuji
Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']
Annotation:
Sampling, the technique of reusing pieces of existing audio tracks to create
new music content, is a very common practice in modern music production. In
this paper, we tackle the challenging task of automatic sample identification,
that is, detecting such sampled content and retrieving the material from which
it originates. To do so, we adopt a self-supervised learning approach that
leverages a multi-track dataset to create positive pairs of artificial mixes,
and design a novel contrastive learn...
Авторы:
Yuxuan Jiang, Zehua Chen, Zeqian Ju, Yusheng Dai, Weibei Dou, Jun Zhu
Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']
Annotation:
Text-to-audio (TTA) generation with fine-grained control signals, e.g.,
precise timing control or intelligible speech content, has been explored in
recent works. However, constrained by data scarcity, their generation
performance at scale is still compromised. In this study, we recast
controllable TTA generation as a multi-task learning problem and introduce a
progressive diffusion modeling approach, ControlAudio. Our method adeptly fits
distributions conditioned on more fine-grained information...
📄 DiTSinger: Scaling Singing Voice Synthesis with Diffusion Transformer and Implicit Alignment
2025-10-14Авторы:
Zongcai Du, Guilin Deng, Xiaofeng Guo, Xin Gao, Linke Li, Kaichang Cheng, Fubo Han, Siyu Yang, Peng Liu, Pan Zhong, Qiang Fu
Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']
Annotation:
Recent progress in diffusion-based Singing Voice Synthesis (SVS) demonstrates
strong expressiveness but remains limited by data scarcity and model
scalability. We introduce a two-stage pipeline: a compact seed set of
human-sung recordings is constructed by pairing fixed melodies with diverse
LLM-generated lyrics, and melody-specific models are trained to synthesize over
500 hours of high-quality Chinese singing data. Building on this corpus, we
propose DiTSinger, a Diffusion Transformer with RoP...
Показано 81 -
90
из 274 записей