📊 Статистика дайджестов
Всего дайджестов: 34022 Добавлено сегодня: 0
Последнее обновление: сегодня
Авторы:
Christiaan M. Geldenhuys, Günther Tonitz, Thomas R. Niesler
Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']
Annotation:
While recent sound event detection (SED) systems can identify baleen whale
calls in marine audio, challenges related to false positive and minority-class
detection persist. We propose the boundary proposal network (BPN), which
extends an existing lightweight SED system. The BPN is inspired by work in
image object detection and aims to reduce the number of false positive
detections. It achieves this by using intermediate latent representations
computed within the backbone classification model to ...
Авторы:
Christiaan M. Geldenhuys, Günther Tonitz, Thomas R. Niesler
Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']
Annotation:
While recent sound event detection (SED) systems can identify baleen whale
calls in marine audio, challenges related to false positive and minority-class
detection persist. We propose the boundary proposal network (BPN), which
extends an existing lightweight SED system. The BPN is inspired by work in
image object detection and aims to reduce the number of false positive
detections. It achieves this by using intermediate latent representations
computed within the backbone classification model to ...
Авторы:
Arshdeep Singh, Vinayak Abrol, Mark D. Plumbley
Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']
Annotation:
Conventional Convolutional Neural Networks (CNNs) in the real domain have
been widely used for audio classification. However, their convolution
operations process multi-channel inputs independently, limiting the ability to
capture correlations among channels. This can lead to suboptimal feature
learning, particularly for complex audio patterns such as multi-channel
spectrogram representations. Quaternion Convolutional Neural Networks (QCNNs)
address this limitation by employing quaternion algebr...
Авторы:
Qianheng Xu
Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']
Annotation:
Over 70 million people worldwide experience stuttering, yet most automatic
speech systems misinterpret disfluent utterances or fail to transcribe them
accurately. Existing methods for stutter correction rely on handcrafted feature
extraction or multi-stage automatic speech recognition (ASR) and text-to-speech
(TTS) pipelines, which separate transcription from audio reconstruction and
often amplify distortions. This work introduces StutterZero and StutterFormer,
the first end-to-end waveform-to-w...
Авторы:
Tong Zhang, Yihuan Huang, Yanzhen Ren
Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']
Annotation:
The growing prevalence of speech deepfakes has raised serious concerns,
particularly in real-world scenarios such as telephone fraud and identity
theft. While many anti-spoofing systems have demonstrated promising performance
on lab-generated synthetic speech, they often fail when confronted with
physical replay attacks-a common and low-cost form of attack used in practical
settings. Our experiments show that models trained on existing datasets exhibit
severe performance degradation, with averag...
Авторы:
Yueqian Lin, Zhengmian Hu, Jayakumar Subramanian, Qinsi Wang, Nikos Vlassis, Hai "Helen" Li, Yiran Chen
Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']
Annotation:
Effective human-AI collaboration on complex reasoning tasks requires that
users understand and interact with the model's process, not just receive an
output. However, the monolithic text from methods like Chain-of-Thought (CoT)
prevents this, as current interfaces lack real-time verbalization and robust
user barge-in. We present AsyncVoice Agent, a system whose asynchronous
architecture decouples a streaming LLM backend from a conversational voice
frontend. This design allows narration and infer...
Авторы:
Chitralekha Gupta, Soundarya Ramesh, Praveen Sasikumar, Kian Peen Yeo, Suranga Nanayakkara
Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']
Annotation:
Unmanned Aerial Vehicles (UAVs) or drones, are increasingly used in search
and rescue missions to detect human presence. Existing systems primarily
leverage vision-based methods which are prone to fail under low-visibility or
occlusion. Drone-based audio perception offers promise but suffers from extreme
ego-noise that masks sounds indicating human presence. Existing datasets are
either limited in diversity or synthetic, lacking real acoustic interactions,
and there are no standardized setups fo...
Авторы:
Mahsa Ghazvini Nejad, Hamed Jafarzadeh Asl, Amin Edraki, Mohammadreza Sadeghi, Masoud Asgharian, Yuanhao Yu, Vahid Partovi Nia
Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']
Annotation:
Personalized Voice Activity Detection (PVAD) systems activate only in
response to a specific target speaker by incorporating speaker embeddings from
enrollment utterances. Unlike existing methods that require architectural
changes, such as FiLM layers, our approach employs a hypernetwork to modify the
weights of a few selected layers within a standard voice activity detection
(VAD) model. This enables speaker conditioning without changing the VAD
architecture, allowing the same VAD model to adap...
Авторы:
Mingxuan Wang, Satoshi Nakamura
Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']
Annotation:
Machine Speech Chain, simulating the human perception-production loop, proves
effective in jointly improving ASR and TTS. We propose TokenChain, a fully
discrete speech chain coupling semantic-token ASR with a two-stage TTS: an
autoregressive text-to-semantic model co-trained with ASR and a
masked-generative semantic-to-acoustic model for synthesis only. End-to-end
feedback across the text interface is enabled with straight-through
argmax/Gumbel-Softmax and balanced with supervised ASR via dynam...
Авторы:
Satvik Dixit, Soham Deshmukh, Bhiksha Raj
Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']
Annotation:
Audio Question Answering (AQA) is a key task for evaluating Audio-Language
Models (ALMs), yet assessing open-ended responses remains challenging. Existing
metrics used for AQA such as BLEU, METEOR and BERTScore, mostly adapted from
NLP and audio captioning, rely on surface similarity and fail to account for
question context, reasoning, and partial correctness. To address the gap in
literature, we make three contributions in this work. First, we introduce
AQEval to enable systematic benchmarking ...
Показано 11 -
20
из 74 записей