📊 Статистика дайджестов

Всего дайджестов: 35039 Добавлено сегодня: 432

Последнее обновление: сегодня

📄 PAVAS: Physics-Aware Video-to-Audio Synthesis

2025-12-10

Авторы:

Oh Hyun-Bin, Yuhta Takida, Toshimitsu Uesaka, Tae-Hyun Oh, Yuki Mitsufuji

Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']

Annotation:

Recent advances in Video-to-Audio (V2A) generation have achieved impressive perceptual quality and temporal synchronization, yet most models remain appearance-driven, capturing visual-acoustic correlations without considering the physical factors that shape real-world sounds. We present Physics-Aware Video-to-Audio Synthesis (PAVAS), a method that incorporates physical reasoning into a latent diffusion-based V2A generation through the Physics-Driven Audio Adapter (Phy-Adapter). The adapter recei...

ID: 2512.08282v1 cs.CV, cs.MM, cs.SD

arXiv PDF

📄 A Sleep Monitoring System Based on Audio, Video and Depth Information

2025-12-09

Авторы:

Lyn Chao-ling Chen, Kuan-Wen Chen, Yi-Ping Hung

Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']

Annotation:

For quantitative evaluation of sleep disturbances, a noninvasive monitoring system is developed by introducing an event-based method. We observe sleeping in home context and classify the sleep disturbances into three types of events: motion events, light-on/off events and noise events. A device with an infrared depth sensor, a RGB camera, and a four-microphone array is used in sleep monitoring in an environment with barely light sources. One background model is established in depth signals for m...

ID: 2512.06282v1 cs.CV, cs.MM

arXiv PDF

📄 HUD: Hierarchical Uncertainty-Aware Disambiguation Network for Composed Video Retrieval

2025-12-04

Авторы:

Zhiwei Chen, Yupeng Hu, Zixu Li, Zhiheng Fu, Haokun Wen, Weili Guan

Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']

Annotation:

Composed Video Retrieval (CVR) is a challenging video retrieval task that utilizes multi-modal queries, consisting of a reference video and modification text, to retrieve the desired target video. The core of this task lies in understanding the multi-modal composed query and achieving accurate composed feature learning. Within multi-modal queries, the video modality typically carries richer semantic content compared to the textual modality. However, previous works have largely overlooked the dis...

ID: 2512.02792v1 cs.CV, cs.MM

arXiv PDF

📄 GAOT: Generating Articulated Objects Through Text-Guided Diffusion Models

2025-12-04

Авторы:

Hao Sun, Lei Fan, Donglin Di, Shaohui Liu

Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']

Annotation:

Articulated object generation has seen increasing advancements, yet existing models often lack the ability to be conditioned on text prompts. To address the significant gap between textual descriptions and 3D articulated object representations, we propose GAOT, a three-phase framework that generates articulated objects from text prompts, leveraging diffusion models and hypergraph learning in a three-step process. First, we fine-tune a point cloud generation model to produce a coarse representati...

ID: 2512.03566v1 cs.CV, cs.MM

arXiv PDF

📄 MoLT: Mixture of Layer-Wise Tokens for Efficient Audio-Visual Learning

2025-12-02

Авторы:

Kyeongha Rho, Hyeongkeun Lee, Jae Won Cho, Joon Son Chung

Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']

Annotation:

In this paper, we propose Mixture of Layer-Wise Tokens (MoLT), a parameter- and memory-efficient adaptation framework for audio-visual learning. The key idea of MoLT is to replace conventional, computationally heavy sequential adaptation at every transformer layer with a parallel, lightweight scheme that extracts and fuses layer-wise tokens only from the late layers. We adopt two types of adapters to distill modality-specific information and cross-modal interaction into compact latent tokens in ...

ID: 2512.00115v1 cs.SD, cs.CV, cs.MM

arXiv PDF

📄 OralGPT-Omni: A Versatile Dental Multimodal Large Language Model

2025-12-01

Авторы:

Jing Hao, Yuci Liang, Lizhuo Lin, Yuxuan Fan, Wenkai Zhou, Kaixin Guo, Zanting Ye, Yanpeng Sun, Xinyu Zhang, Yanqi Yang, Qiankun Li, Hao Tang, James Kit-Hon Tsoi, Linlin Shen, Kuo Feng Hung

Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']

Annotation:

Multimodal Large Language Models (MLLMs) have exhibited immense potential across numerous medical specialties; yet, dentistry remains underexplored, in part due to limited domain-specific data, scarce dental expert annotations, insufficient modality-specific modeling, and challenges in reliability. In this paper, we present OralGPT-Omni, the first dental-specialized MLLM designed for comprehensive and trustworthy analysis across diverse dental imaging modalities and clinical tasks. To explicitly...

ID: 2511.22055v1 cs.CV, cs.MM

arXiv PDF

📄 Neural B-Frame Coding: Tackling Domain Shift Issues with Lightweight Online Motion Resolution Adaptation

2025-11-26

Авторы:

Sang NguyenQuang, Xiem HoangVan, Wen-Hsiao Peng

Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']

Annotation:

Learned B-frame codecs with hierarchical temporal prediction often encounter the domain-shift issue due to mismatches between the Group-of-Pictures (GOP) sizes for training and testing, leading to inaccurate motion estimates, particularly for large motion. A common solution is to turn large motion into small motion by downsampling video frames during motion estimation. However, determining the optimal downsampling factor typically requires costly rate-distortion optimization. This work introduce...

ID: 2511.18724v1 eess.IV, cs.CV, cs.MM

arXiv PDF

📄 Parallel Vision Token Scheduling for Fast and Accurate Multimodal LMMs Inference

2025-11-26

Авторы:

Wengyi Zhan, Mingbao Lin, Zhihang Lin, Rongrong Ji

Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']

Annotation:

Multimodal large language models (MLLMs) deliver impressive vision-language reasoning but suffer steep inference latency because self-attention scales quadratically with sequence length and thousands of visual tokens contributed by high-resolution images. Naively pruning less-informative visual tokens reduces this burden, yet indiscriminate removal can strip away contextual cues essential for background or fine-grained questions, undermining accuracy. In this paper, we present ParVTS (Parallel V...

ID: 2511.18875v1 cs.CV, cs.MM

arXiv PDF

📄 Signal: Selective Interaction and Global-local Alignment for Multi-Modal Object Re-Identification

2025-11-25

Авторы:

Yangyang Liu, Yuhao Wang, Pingping Zhang

Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']

Annotation:

Multi-modal object Re-IDentification (ReID) is devoted to retrieving specific objects through the exploitation of complementary multi-modal image information. Existing methods mainly concentrate on the fusion of multi-modal features, yet neglecting the background interference. Besides, current multi-modal fusion methods often focus on aligning modality pairs but suffer from multi-modal consistency alignment. To address these issues, we propose a novel selective interaction and global-local align...

ID: 2511.17965v1 cs.CV, cs.MM

arXiv PDF

📄 CPSL: Representing Volumetric Video via Content-Promoted Scene Layers

2025-11-20

Авторы:

Kaiyuan Hu, Yili Jin, Junhua Liu, Xize Duan, Hong Kang, Xue Liu

Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']

Annotation:

Volumetric video enables immersive and interactive visual experiences by supporting free viewpoint exploration and realistic motion parallax. However, existing volumetric representations from explicit point clouds to implicit neural fields, remain costly in capture, computation, and rendering, which limits their scalability for on-demand video and reduces their feasibility for real-time communication. To bridge this gap, we propose Content-Promoted Scene Layers (CPSL), a compact 2.5D video rep...

ID: 2511.14927v1 cs.CV, cs.MM

arXiv PDF

Показано 1 - 10 из 130 записей