📊 Статистика дайджестов

Всего дайджестов: 34022 Добавлено сегодня: 0

Последнее обновление: сегодня

📄 True Self-Supervised Novel View Synthesis is Transferable

2025-10-17

Авторы:

Thomas W. Mitchel, Hyunwoo Ryu, Vincent Sitzmann

Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']

Annotation:

In this paper, we identify that the key criterion for determining whether a model is truly capable of novel view synthesis (NVS) is transferability: Whether any pose representation extracted from one video sequence can be used to re-render the same camera trajectory in another. We analyze prior work on self-supervised NVS and find that their predicted poses do not transfer: The same set of poses lead to different camera trajectories in different 3D scenes. Here, we present XFactor, the first geo...

ID: 2510.13063v1 cs.CV, cs.AI, cs.LG

arXiv PDF

📄 CanvasMAR: Improving Masked Autoregressive Video Generation With Canvas

2025-10-17

Авторы:

Zian Li, Muhan Zhang

Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']

Annotation:

Masked autoregressive models (MAR) have recently emerged as a powerful paradigm for image and video generation, combining the flexibility of masked modeling with the potential of continuous tokenizer. However, video MAR models suffer from two major limitations: the slow-start problem, caused by the lack of a structured global prior at early sampling stages, and error accumulation across the autoregression in both spatial and temporal dimensions. In this work, we propose CanvasMAR, a novel video ...

ID: 2510.13669v1 cs.CV, cs.AI, cs.LG

arXiv PDF

📄 Multi-Scale High-Resolution Logarithmic Grapher Module for Efficient Vision GNNs

2025-10-17

Авторы:

Mustafa Munir, Alex Zhang, Radu Marculescu

Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']

Annotation:

Vision graph neural networks (ViG) have demonstrated promise in vision tasks as a competitive alternative to conventional convolutional neural nets (CNN) and transformers (ViTs); however, common graph construction methods, such as k-nearest neighbor (KNN), can be expensive on larger images. While methods such as Sparse Vision Graph Attention (SVGA) have shown promise, SVGA's fixed step scale can lead to over-squashing and missing multiple connections to gain the same information that could be ga...

ID: 2510.13740v1 cs.CV, cs.AI, cs.LG

arXiv PDF

📄 RECODE: Reasoning Through Code Generation for Visual Question Answering

2025-10-17

Авторы:

Junhong Shen, Mu Cai, Bo Hu, Ameet Talwalkar, David A Ross, Cordelia Schmid, Alireza Fathi

Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']

Annotation:

Multimodal Large Language Models (MLLMs) struggle with precise reasoning for structured visuals like charts and diagrams, as pixel-based perception lacks a mechanism for verification. To address this, we propose to leverage derendering -- the process of reverse-engineering visuals into executable code -- as a new modality for verifiable visual reasoning. Specifically, we propose RECODE, an agentic framework that first generates multiple candidate programs to reproduce the input image. It then us...

ID: 2510.13756v1 cs.CV, cs.AI, cs.LG

arXiv PDF

📄 Deep Attention-guided Adaptive Subsampling

2025-10-16

Авторы:

Sharath M Shankaranarayana, Soumava Kumar Roy, Prasad Sudhakar, Chandan Aladahalli

Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']

Annotation:

Although deep neural networks have provided impressive gains in performance, these improvements often come at the cost of increased computational complexity and expense. In many cases, such as 3D volume or video classification tasks, not all slices or frames are necessary due to inherent redundancies. To address this issue, we propose a novel learnable subsampling framework that can be integrated into any neural network architecture. Subsampling, being a nondifferentiable operation, poses signif...

ID: 2510.12376v1 cs.CV, cs.AI, cs.LG

arXiv PDF

📄 VQArt-Bench: A semantically rich VQA Benchmark for Art and Cultural Heritage

2025-10-16

Авторы:

A. Alfarano, L. Venturoli, D. Negueruela del Castillo

Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']

Annotation:

Multimodal Large Language Models (MLLMs) have demonstrated significant capabilities in joint visual and linguistic tasks. However, existing Visual Question Answering (VQA) benchmarks often fail to evaluate deep semantic understanding, particularly in complex domains like visual art analysis. Confined to simple syntactic structures and surface-level attributes, these questions fail to capture the diversity and depth of human visual inquiry. This limitation incentivizes models to exploit statistic...

ID: 2510.12750v1 cs.CV, cs.AI, cs.LG

arXiv PDF

📄 UniFusion: Vision-Language Model as Unified Encoder in Image Generation

2025-10-16

Авторы:

Kevin Li, Manuel Brack, Sudeep Katakol, Hareesh Ravi, Ajinkya Kale

Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']

Annotation:

Although recent advances in visual generation have been remarkable, most existing architectures still depend on distinct encoders for images and text. This separation constrains diffusion models' ability to perform cross-modal reasoning and knowledge transfer. Prior attempts to bridge this gap often use the last layer information from VLM, employ multiple visual encoders, or train large unified models jointly for text and image generation, which demands substantial computational resources and la...

ID: 2510.12789v1 cs.CV, cs.AI, cs.LG

arXiv PDF

📄 CuMPerLay: Learning Cubical Multiparameter Persistence Vectorizations

2025-10-16

Авторы:

Caner Korkmaz, Brighton Nuwagira, Barış Coşkunuzer, Tolga Birdal

Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']

Annotation:

We present CuMPerLay, a novel differentiable vectorization layer that enables the integration of Cubical Multiparameter Persistence (CMP) into deep learning pipelines. While CMP presents a natural and powerful way to topologically work with images, its use is hindered by the complexity of multifiltration structures as well as the vectorization of CMP. In face of these challenges, we introduce a new algorithm for vectorizing MP homologies of cubical complexes. Our CuMPerLay decomposes the CMP int...

ID: 2510.12795v1 cs.CV, cs.AI, cs.LG, math.AT, stat.ML

arXiv PDF

📄 MRI Brain Tumor Detection with Computer Vision

2025-10-15

Авторы:

Jack Krolik, Jake Lynn, John Henry Rudden, Dmytro Vremenko

Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']

Annotation:

This study explores the application of deep learning techniques in the automated detection and segmentation of brain tumors from MRI scans. We employ several machine learning models, including basic logistic regression, Convolutional Neural Networks (CNNs), and Residual Networks (ResNet) to classify brain tumors effectively. Additionally, we investigate the use of U-Net for semantic segmentation and EfficientDet for anchor-based object detection to enhance the localization and identification of ...

ID: 2510.10250v1 cs.CV, cs.AI, cs.LG, 68T07, 68U10, I.2.6; I.2.10; I.4.6

arXiv PDF

📄 Mesh-Gait: A Unified Framework for Gait Recognition Through Multi-Modal Representation Learning from 2D Silhouettes

2025-10-15

Авторы:

Zhao-Yang Wang, Jieneng Chen, Jiang Liu, Yuxiang Guo, Rama Chellappa

Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']

Annotation:

Gait recognition, a fundamental biometric technology, leverages unique walking patterns for individual identification, typically using 2D representations such as silhouettes or skeletons. However, these methods often struggle with viewpoint variations, occlusions, and noise. Multi-modal approaches that incorporate 3D body shape information offer improved robustness but are computationally expensive, limiting their feasibility for real-time applications. To address these challenges, we introduce ...

ID: 2510.10406v1 cs.CV, cs.AI, cs.LG

arXiv PDF

Показано 151 - 160 из 358 записей