📊 Статистика дайджестов

Всего дайджестов: 34022 Добавлено сегодня: 82

Последнее обновление: сегодня

📄 Hear What Matters! Text-conditioned Selective Video-to-Audio Generation

2025-12-04

Авторы:

Junwon Lee, Juhan Nam, Jiyoung Lee

Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']

Annotation:

This work introduces a new task, text-conditioned selective video-to-audio (V2A) generation, which produces only the user-intended sound from a multi-object video. This capability is especially crucial in multimedia production, where audio tracks are handled individually for each sound source for precise editing, mixing, and creative control. However, current approaches generate single source-mixed sounds at once, largely because visual features are entangled, and region cues or prompts often fa...

ID: 2512.02650v1 cs.CV, cs.LG, cs.MM, cs.SD, eess.AS

arXiv PDF

📄 From Pixels to Feelings: Aligning MLLMs with Human Cognitive Perception of Images

2025-12-02

Авторы:

Yiming Chen, Junlin Han, Tianyi Bai, Shengbang Tong, Filippos Kokkinos, Philip Torr

Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']

Annotation:

While Multimodal Large Language Models (MLLMs) are adept at answering what is in an image-identifying objects and describing scenes-they often lack the ability to understand how an image feels to a human observer. This gap is most evident when considering subjective cognitive properties, such as what makes an image memorable, funny, aesthetically pleasing, or emotionally evocative. To systematically address this challenge, we introduce CogIP-Bench, a comprehensive benchmark for evaluating MLLMs ...

ID: 2511.22805v1 cs.CV, cs.LG, cs.MM

arXiv PDF

📄 Art2Music: Generating Music for Art Images with Multi-modal Feeling Alignment

2025-12-02

Авторы:

Jiaying Hong, Ting Zhu, Thanet Markchom, Huizhi Liang

Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']

Annotation:

With the rise of AI-generated content (AIGC), generating perceptually natural and feeling-aligned music from multimodal inputs has become a central challenge. Existing approaches often rely on explicit emotion labels that require costly annotation, underscoring the need for more flexible feeling-aligned methods. To support multimodal music generation, we construct ArtiCaps, a pseudo feeling-aligned image-music-text dataset created by semantically matching descriptions from ArtEmis and MusicCaps....

ID: 2512.00120v1 cs.SD, cs.AI, cs.CV, cs.LG, cs.MM

arXiv PDF

📄 Multimodal Real-Time Anomaly Detection and Industrial Applications

2025-11-26

Авторы:

Aman Verma, Keshav Samdani, Mohd. Samiuddin Shafi

Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']

Annotation:

This paper presents the design, implementation, and evolution of a comprehensive multimodal room-monitoring system that integrates synchronized video and audio processing for real-time activity recognition and anomaly detection. We describe two iterations of the system: an initial lightweight implementation using YOLOv8, ByteTrack, and the Audio Spectrogram Transformer (AST), and an advanced version that incorporates multi-model audio ensembles, hybrid object detection, bidirectional cross-modal...

ID: 2511.18698v1 cs.SD, cs.AI, cs.CV, cs.LG, cs.MM

arXiv PDF

📄 DeCo-VAE: Learning Compact Latents for Video Reconstruction via Decoupled Representation

2025-11-20

Авторы:

Xiangchen Yin, Jiahui Yuan, Zhangchi Hu, Wenzhang Sun, Jie Chen, Xiaozhen Qiao, Hao Li, Xiaoyan Sun

Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']

Annotation:

Existing video Variational Autoencoders (VAEs) generally overlook the similarity between frame contents, leading to redundant latent modeling. In this paper, we propose decoupled VAE (DeCo-VAE) to achieve compact latent representation. Instead of encoding RGB pixels directly, we decompose video content into distinct components via explicit decoupling: keyframe, motion and residual, and learn dedicated latent representation for each. To avoid cross-component interference, we design dedicated enco...

ID: 2511.14530v1 cs.CV, cs.LG, cs.MM

arXiv PDF

📄 Self-Supervised Compression and Artifact Correction for Streaming Underwater Imaging Sonar

2025-11-19

Авторы:

Rongsheng Qian, Chi Xu, Xiaoqiang Ma, Hao Fang, Yili Jin, William I. Atlas, Jiangchuan Liu

Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']

Annotation:

Real-time imaging sonar has become an important tool for underwater monitoring in environments where optical sensing is unreliable. Its broader use is constrained by two coupled challenges: highly limited uplink bandwidth and severe sonar-specific artifacts (speckle, motion blur, reverberation, acoustic shadows) that affect up to 98% of frames. We present SCOPE, a self-supervised framework that jointly performs compression and artifact correction without clean-noise pairs or synthetic assumption...

ID: 2511.13922v1 eess.IV, cs.CV, cs.LG, cs.MM

arXiv PDF

📄 Calibrated Multimodal Representation Learning with Missing Modalities

2025-11-18

Авторы:

Xiaohao Liu, Xiaobo Xia, Jiaheng Wei, Shuo Yang, Xiu Su, See-Kiong Ng, Tat-Seng Chua

Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']

Annotation:

Multimodal representation learning harmonizes distinct modalities by aligning them into a unified latent space. Recent research generalizes traditional cross-modal alignment to produce enhanced multimodal synergy but requires all modalities to be present for a common instance, making it challenging to utilize prevalent datasets with missing modalities. We provide theoretical insights into this issue from an anchor shift perspective. Observed modalities are aligned with a local anchor that deviat...

ID: 2511.12034v1 cs.CV, cs.LG, cs.MM

arXiv PDF

📄 Frequency-Spatial Interaction Driven Network for Low-Light Image Enhancement

2025-10-29

Авторы:

Yunhong Tao, Wenbing Tao, Xiang Xiang

Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']

Annotation:

Low-light image enhancement (LLIE) aims at improving the perception or interpretability of an image captured in an environment with poor illumination. With the advent of deep learning, the LLIE technique has achieved significant breakthroughs. However, existing LLIE methods either ignore the important role of frequency domain information or fail to effectively promote the propagation and flow of information, limiting the LLIE performance. In this paper, we develop a novel frequency-spatial inter...

ID: 2510.22154v1 eess.IV, cs.CV, cs.LG, cs.MM, eess.SP

arXiv PDF

📄 Post-surgical Endometriosis Segmentation in Laparoscopic Videos

2025-10-18

Авторы:

Andreas Leibetseder, Klaus Schoeffmann, Jörg Keckstein, Simon Keckstein

Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']

Annotation:

Endometriosis is a common women's condition exhibiting a manifold visual appearance in various body-internal locations. Having such properties makes its identification very difficult and error-prone, at least for laymen and non-specialized medical practitioners. In an attempt to provide assistance to gynecologic physicians treating endometriosis, this demo paper describes a system that is trained to segment one frequently occurring visual appearance of endometriosis, namely dark endometrial impl...

ID: 2510.13899v1 cs.CV, cs.LG, cs.MM

arXiv PDF

📄 MCE: Towards a General Framework for Handling Missing Modalities under Imbalanced Missing Rates

2025-10-15

Авторы:

Binyu Zhao, Wei Zhang, Zhaonian Zou

Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']

Annotation:

Multi-modal learning has made significant advances across diverse pattern recognition applications. However, handling missing modalities, especially under imbalanced missing rates, remains a major challenge. This imbalance triggers a vicious cycle: modalities with higher missing rates receive fewer updates, leading to inconsistent learning progress and representational degradation that further diminishes their contribution. Existing methods typically focus on global dataset-level balancing, ofte...

ID: 2510.10534v1 cs.CV, cs.LG, cs.MM

arXiv PDF

Показано 1 - 10 из 20 записей