📊 Статистика дайджестов

Всего дайджестов: 34022 Добавлено сегодня: 0

Последнее обновление: сегодня

📄 WhaleVAD-BPN: Improving Baleen Whale Call Detection with Boundary Proposal Networks and Post-processing Optimisation

2025-10-28

Авторы:

Christiaan M. Geldenhuys, Günther Tonitz, Thomas R. Niesler

Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']

Annotation:

While recent sound event detection (SED) systems can identify baleen whale calls in marine audio, challenges related to false positive and minority-class detection persist. We propose the boundary proposal network (BPN), which extends an existing lightweight SED system. The BPN is inspired by work in image object detection and aims to reduce the number of false positive detections. It achieves this by using intermediate latent representations computed within the backbone classification model to ...

ID: 2510.21280v1 eess.AS, cs.AI, cs.LG, cs.SD, q-bio.QM

arXiv PDF

📄 WhaleVAD-BPN: Improving Baleen Whale Call Detection with Boundary Proposal Networks and Post-processing Optimisation

2025-10-28

Авторы:

Christiaan M. Geldenhuys, Günther Tonitz, Thomas R. Niesler

Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']

Annotation:

ID: 2510.21280v2 eess.AS, cs.AI, cs.LG, cs.SD, q-bio.QM

arXiv PDF

📄 Compressing Quaternion Convolutional Neural Networks for Audio Classification

2025-10-28

Авторы:

Arshdeep Singh, Vinayak Abrol, Mark D. Plumbley

Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']

Annotation:

Conventional Convolutional Neural Networks (CNNs) in the real domain have been widely used for audio classification. However, their convolution operations process multi-channel inputs independently, limiting the ability to capture correlations among channels. This can lead to suboptimal feature learning, particularly for complex audio patterns such as multi-channel spectrogram representations. Quaternion Convolutional Neural Networks (QCNNs) address this limitation by employing quaternion algebr...

ID: 2510.21388v1 eess.AS, cs.AI, cs.SD, eess.SP

arXiv PDF

📄 StutterZero and StutterFormer: End-to-End Speech Conversion for Stuttering Transcription and Correction

2025-10-24

Авторы:

Qianheng Xu

Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']

Annotation:

Over 70 million people worldwide experience stuttering, yet most automatic speech systems misinterpret disfluent utterances or fail to transcribe them accurately. Existing methods for stutter correction rely on handcrafted feature extraction or multi-stage automatic speech recognition (ASR) and text-to-speech (TTS) pipelines, which separate transcription from audio reconstruction and often amplify distortions. This work introduces StutterZero and StutterFormer, the first end-to-end waveform-to-w...

ID: 2510.18938v1 eess.AS, cs.AI, cs.CL

arXiv PDF

📄 EchoFake: A Replay-Aware Dataset for Practical Speech Deepfake Detection

2025-10-24

Авторы:

Tong Zhang, Yihuan Huang, Yanzhen Ren

Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']

Annotation:

The growing prevalence of speech deepfakes has raised serious concerns, particularly in real-world scenarios such as telephone fraud and identity theft. While many anti-spoofing systems have demonstrated promising performance on lab-generated synthetic speech, they often fail when confronted with physical replay attacks-a common and low-cost form of attack used in practical settings. Our experiments show that models trained on existing datasets exhibit severe performance degradation, with averag...

ID: 2510.19414v1 eess.AS, cs.AI, cs.SD

arXiv PDF

📄 AsyncVoice Agent: Real-Time Explanation for LLM Planning and Reasoning

2025-10-22

Авторы:

Yueqian Lin, Zhengmian Hu, Jayakumar Subramanian, Qinsi Wang, Nikos Vlassis, Hai "Helen" Li, Yiran Chen

Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']

Annotation:

Effective human-AI collaboration on complex reasoning tasks requires that users understand and interact with the model's process, not just receive an output. However, the monolithic text from methods like Chain-of-Thought (CoT) prevents this, as current interfaces lack real-time verbalization and robust user barge-in. We present AsyncVoice Agent, a system whose asynchronous architecture decouples a streaming LLM backend from a conversational voice frontend. This design allows narration and infer...

ID: 2510.16156v1 eess.AS, cs.AI, cs.MM

arXiv PDF

📄 DroneAudioset: An Audio Dataset for Drone-based Search and Rescue

2025-10-21

Авторы:

Chitralekha Gupta, Soundarya Ramesh, Praveen Sasikumar, Kian Peen Yeo, Suranga Nanayakkara

Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']

Annotation:

Unmanned Aerial Vehicles (UAVs) or drones, are increasingly used in search and rescue missions to detect human presence. Existing systems primarily leverage vision-based methods which are prone to fail under low-visibility or occlusion. Drone-based audio perception offers promise but suffers from extreme ego-noise that masks sounds indicating human presence. Existing datasets are either limited in diversity or synthetic, lacking real acoustic interactions, and there are no standardized setups fo...

ID: 2510.15383v1 eess.AS, cs.AI, cs.SD

arXiv PDF

📄 HyWA: Hypernetwork Weight Adapting Personalized Voice Activity Detection

2025-10-17

Авторы:

Mahsa Ghazvini Nejad, Hamed Jafarzadeh Asl, Amin Edraki, Mohammadreza Sadeghi, Masoud Asgharian, Yuanhao Yu, Vahid Partovi Nia

Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']

Annotation:

Personalized Voice Activity Detection (PVAD) systems activate only in response to a specific target speaker by incorporating speaker embeddings from enrollment utterances. Unlike existing methods that require architectural changes, such as FiLM layers, our approach employs a hypernetwork to modify the weights of a few selected layers within a standard voice activity detection (VAD) model. This enables speaker conditioning without changing the VAD architecture, allowing the same VAD model to adap...

ID: 2510.12947v1 eess.AS, cs.AI, cs.LG, cs.SD

arXiv PDF

📄 TokenChain: A Discrete Speech Chain via Semantic Token Modeling

2025-10-09

Авторы:

Mingxuan Wang, Satoshi Nakamura

Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']

Annotation:

Machine Speech Chain, simulating the human perception-production loop, proves effective in jointly improving ASR and TTS. We propose TokenChain, a fully discrete speech chain coupling semantic-token ASR with a two-stage TTS: an autoregressive text-to-semantic model co-trained with ASR and a masked-generative semantic-to-acoustic model for synthesis only. End-to-end feedback across the text interface is enabled with straight-through argmax/Gumbel-Softmax and balanced with supervised ASR via dynam...

ID: 2510.06201v1 eess.AS, cs.AI, cs.CL, cs.SD

arXiv PDF

📄 AURA Score: A Metric For Holistic Audio Question Answering Evaluation

2025-10-08

Авторы:

Satvik Dixit, Soham Deshmukh, Bhiksha Raj

Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']

Annotation:

Audio Question Answering (AQA) is a key task for evaluating Audio-Language Models (ALMs), yet assessing open-ended responses remains challenging. Existing metrics used for AQA such as BLEU, METEOR and BERTScore, mostly adapted from NLP and audio captioning, rely on surface similarity and fail to account for question context, reasoning, and partial correctness. To address the gap in literature, we make three contributions in this work. First, we introduce AQEval to enable systematic benchmarking ...

ID: 2510.04934v1 eess.AS, cs.AI

arXiv PDF

Показано 11 - 20 из 74 записей