📊 Статистика дайджестов

Всего дайджестов: 34022 Добавлено сегодня: 82

Последнее обновление: сегодня

📄 MRD: Multi-resolution Retrieval-Detection Fusion for High-Resolution Image Understanding

2025-12-04

Авторы:

Fan Yang, Kaihao Zhang

Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']

Annotation:

Understanding high-resolution images remains a significant challenge for multimodal large language models (MLLMs). Recent study address this issue by dividing the image into smaller crops and computing the semantic similarity between each crop and a query using a pretrained retrieval-augmented generation (RAG) model. The most relevant crops are then selected to localize the target object and suppress irrelevant information. However, such crop-based processing can fragment complete objects across...

ID: 2512.02906v2 cs.CV, cs.AI, cs.MM

arXiv PDF

📄 MRD: Multi-resolution Retrieval-Detection Fusion for High-Resolution Image Understanding

2025-12-04

Авторы:

Fan Yang, Kaihao Zhang

Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']

Annotation:

ID: 2512.02906v1 cs.CV, cs.AI, cs.MM

arXiv PDF

📄 Pistachio: Towards Synthetic, Balanced, and Long-Form Video Anomaly Benchmarks

2025-11-26

Авторы:

Jie Li, Hongyi Cai, Mingkang Dong, Muxin Pu, Shan You, Fei Wang, Tao Huang

Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']

Annotation:

Automatically detecting abnormal events in videos is crucial for modern autonomous systems, yet existing Video Anomaly Detection (VAD) benchmarks lack the scene diversity, balanced anomaly coverage, and temporal complexity needed to reliably assess real-world performance. Meanwhile, the community is increasingly moving toward Video Anomaly Understanding (VAU), which requires deeper semantic and causal reasoning but remains difficult to benchmark due to the heavy manual annotation effort it deman...

ID: 2511.19474v1 cs.CV, cs.AI, cs.MM

arXiv PDF

📄 Tracking and Segmenting Anything in Any Modality

2025-11-26

Авторы:

Tianlu Zhang, Qiang Zhang, Guiguang Ding, Jungong Han

Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']

Annotation:

Tracking and segmentation play essential roles in video understanding, providing basic positional information and temporal association of objects within video sequences. Despite their shared objective, existing approaches often tackle these tasks using specialized architectures or modality-specific parameters, limiting their generalization and scalability. Recent efforts have attempted to unify multiple tracking and segmentation subtasks from the perspectives of any modality input or multi-task ...

ID: 2511.19475v1 cs.CV, cs.AI, cs.MM

arXiv PDF

📄 Decoupled Audio-Visual Dataset Distillation

2025-11-25

Авторы:

Wenyuan Li, Guang Li, Keisuke Maeda, Takahiro Ogawa, Miki Haseyama

Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']

Annotation:

Audio-Visual Dataset Distillation aims to compress large-scale datasets into compact subsets while preserving the performance of the original data. However, conventional Distribution Matching (DM) methods struggle to capture intrinsic cross-modal alignment. Subsequent studies have attempted to introduce cross-modal matching, but two major challenges remain: (i) independently and randomly initialized encoders lead to inconsistent modality mapping spaces, increasing training difficulty; and (ii) d...

ID: 2511.17890v1 cs.CV, cs.AI, cs.MM

arXiv PDF

📄 RacketVision: A Multiple Racket Sports Benchmark for Unified Ball and Racket Analysis

2025-11-25

Авторы:

Linfeng Dong, Yuchen Yang, Hao Wu, Wei Wang, Yuenan HouZhihang Zhong, Xiao Sun

Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']

Annotation:

We introduce RacketVision, a novel dataset and benchmark for advancing computer vision in sports analytics, covering table tennis, tennis, and badminton. The dataset is the first to provide large-scale, fine-grained annotations for racket pose alongside traditional ball positions, enabling research into complex human-object interactions. It is designed to tackle three interconnected tasks: fine-grained ball tracking, articulated racket pose estimation, and predictive ball trajectory forecasting....

ID: 2511.17045v1 cs.CV, cs.AI, cs.MM

arXiv PDF

📄 SEPS: Semantic-enhanced Patch Slimming Framework for fine-grained cross-modal alignment

2025-11-06

Авторы:

Xinyu Mao, Junsi Li, Haoji Zhang, Yu Liang, Ming Sun

Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']

Annotation:

Fine-grained cross-modal alignment aims to establish precise local correspondences between vision and language, forming a cornerstone for visual question answering and related multimodal applications. Current approaches face challenges in addressing patch redundancy and ambiguity, which arise from the inherent information density disparities across modalities. Recently, Multimodal Large Language Models (MLLMs) have emerged as promising solutions to bridge this gap through their robust semantic g...

ID: 2511.01390v1 cs.CV, cs.AI, cs.MM

arXiv PDF

📄 How Far Are Surgeons from Surgical World Models? A Pilot Study on Zero-shot Surgical Video Generation with Expert Assessment

2025-11-06

Авторы:

Zhen Chen, Qing Xu, Jinlin Wu, Biao Yang, Yuhao Zhai, Geng Guo, Jing Zhang, Yinlu Ding, Nassir Navab, Jiebo Luo

Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']

Annotation:

Foundation models in video generation are demonstrating remarkable capabilities as potential world models for simulating the physical world. However, their application in high-stakes domains like surgery, which demand deep, specialized causal knowledge rather than general physical rules, remains a critical unexplored gap. To systematically address this challenge, we present SurgVeo, the first expert-curated benchmark for video generation model evaluation in surgery, and the Surgical Plausibility...

ID: 2511.01775v1 cs.CV, cs.AI, cs.MM

arXiv PDF

📄 STATUS Bench: A Rigorous Benchmark for Evaluating Object State Understanding in Vision-Language Models

2025-10-29

Авторы:

Mahiro Ukai, Shuhei Kurita, Nakamasa Inoue

Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']

Annotation:

Object state recognition aims to identify the specific condition of objects, such as their positional states (e.g., open or closed) and functional states (e.g., on or off). While recent Vision-Language Models (VLMs) are capable of performing a variety of multimodal tasks, it remains unclear how precisely they can identify object states. To alleviate this issue, we introduce the STAte and Transition UnderStanding Benchmark (STATUS Bench), the first benchmark for rigorously evaluating the ability ...

ID: 2510.22571v1 cs.CV, cs.AI, cs.MM

arXiv PDF

📄 LLM-based Fusion of Multi-modal Features for Commercial Memorability Prediction

2025-10-29

Авторы:

Aleksandar Pramov

Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']

Annotation:

This paper addresses the prediction of commercial (brand) memorability as part of "Subtask 2: Commercial/Ad Memorability" within the "Memorability: Predicting movie and commercial memorability" task at the MediaEval 2025 workshop competition. We propose a multimodal fusion system with a Gemma-3 LLM backbone that integrates pre-computed visual (ViT) and textual (E5) features by multi-modal projections. The model is adapted using Low-Rank Adaptation (LoRA). A heavily-tuned ensemble of gradient boo...

ID: 2510.22829v1 cs.CV, cs.AI, cs.MM

arXiv PDF

Показано 1 - 10 из 26 записей