📊 Статистика дайджестов

Всего дайджестов: 34022 Добавлено сегодня: 82

Последнее обновление: сегодня

📄 MMEdge: Accelerating On-device Multimodal Inference via Pipelined Sensing and Encoding

2025-10-31

Авторы:

Runxi Huang, Mingxuan Yu, Mingyu Tsoi, Xiaomin Ouyang

Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']

Annotation:

Real-time multimodal inference on resource-constrained edge devices is essential for applications such as autonomous driving, human-computer interaction, and mobile health. However, prior work often overlooks the tight coupling between sensing dynamics and model execution, as well as the complex inter-modality dependencies. In this paper, we propose MMEdge, an new on-device multi-modal inference framework based on pipelined sensing and encoding. Instead of waiting for complete sensor inputs, MME...

ID: 2510.25327v2 cs.CV, cs.AI, cs.LG

arXiv PDF

📄 DynaStride: Dynamic Stride Windowing with MMCoT for Instructional Multi-Scene Captioning

2025-10-30

Авторы:

Eddison Pham, Prisha Priyadarshini, Adrian Maliackel, Kanishk Bandi, Cristian Meo, Kevin Zhu

Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']

Annotation:

Scene-level captioning in instructional videos can enhance learning by requiring an understanding of both visual cues and temporal structure. By aligning visual cues with textual guidance, this understanding supports procedural learning and multimodal reasoning, providing a richer context for skill acquisition. However, captions that fail to capture this structure may lack coherence and quality, which can create confusion and undermine the video's educational intent. To address this gap, we intr...

ID: 2510.23907v1 cs.CV, cs.AI, cs.LG

arXiv PDF

📄 Mars-Bench: A Benchmark for Evaluating Foundation Models for Mars Science Tasks

2025-10-30

Авторы:

Mirali Purohit, Bimal Gajera, Vatsal Malaviya, Irish Mehta, Kunal Kasodekar, Jacob Adler, Steven Lu, Umaa Rebbapragada, Hannah Kerner

Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']

Annotation:

Foundation models have enabled rapid progress across many specialized domains by leveraging large-scale pre-training on unlabeled data, demonstrating strong generalization to a variety of downstream tasks. While such models have gained significant attention in fields like Earth Observation, their application to Mars science remains limited. A key enabler of progress in other domains has been the availability of standardized benchmarks that support systematic evaluation. In contrast, Mars science...

ID: 2510.24010v1 cs.CV, cs.AI, cs.LG

arXiv PDF

📄 Does Object Binding Naturally Emerge in Large Pretrained Vision Transformers?

2025-10-30

Авторы:

Yihao Li, Saeed Salehi, Lyle Ungar, Konrad P. Kording

Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']

Annotation:

Object binding, the brain's ability to bind the many features that collectively represent an object into a coherent whole, is central to human cognition. It groups low-level perceptual features into high-level object representations, stores those objects efficiently and compositionally in memory, and supports human reasoning about individual object instances. While prior work often imposes object-centric attention (e.g., Slot Attention) explicitly to probe these benefits, it remains unclear whet...

ID: 2510.24709v1 cs.CV, cs.AI, cs.LG, q-bio.NC

arXiv PDF

📄 Progressive Growing of Patch Size: Curriculum Learning for Accelerated and Improved Medical Image Segmentation

2025-10-29

Авторы:

Stefan M. Fischer, Johannes Kiechle, Laura Daza, Lina Felsner, Richard Osuala, Daniel M. Lang, Karim Lekadir, Jan C. Peeken, Julia A. Schnabel

Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']

Annotation:

In this work, we introduce Progressive Growing of Patch Size, an automatic curriculum learning approach for 3D medical image segmentation. Our approach progressively increases the patch size during model training, resulting in an improved class balance for smaller patch sizes and accelerated convergence of the training process. We evaluate our curriculum approach in two settings: a resource-efficient mode and a performance mode, both regarding Dice score performance and computational costs acros...

ID: 2510.23241v1 cs.CV, cs.AI, cs.LG

arXiv PDF

📄 Multitask Multimodal Self-Supervised Learning for Medical Images

2025-10-29

Авторы:

Cristian Simionescu

Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']

Annotation:

This thesis works to address a pivotal challenge in medical image analysis: the reliance on extensive labeled datasets, which are often limited due to the need for expert annotation and constrained by privacy and legal issues. By focusing on the development of self-supervised learning techniques and domain adaptation methods, this research aims to circumvent these limitations, presenting a novel approach to enhance the utility and efficacy of deep learning in medical imaging. Central to this t...

ID: 2510.23325v1 cs.CV, cs.AI, cs.LG

arXiv PDF

📄 Towards General Modality Translation with Contrastive and Predictive Latent Diffusion Bridge

2025-10-28

Авторы:

Nimrod Berman, Omkar Joglekar, Eitan Kosman, Dotan Di Castro, Omri Azencot

Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']

Annotation:

Recent advances in generative modeling have positioned diffusion models as state-of-the-art tools for sampling from complex data distributions. While these models have shown remarkable success across single-modality domains such as images and audio, extending their capabilities to Modality Translation (MT), translating information across different sensory modalities, remains an open challenge. Existing approaches often rely on restrictive assumptions, including shared dimensionality, Gaussian so...

ID: 2510.20819v2 cs.CV, cs.AI, cs.LG

arXiv PDF

📄 VESSA: Video-based objEct-centric Self-Supervised Adaptation for Visual Foundation Models

2025-10-28

Авторы:

Jesimon Barreto, Carlos Caetano, André Araujo, William Robson Schwartz

Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']

Annotation:

Foundation models have advanced computer vision by enabling strong performance across diverse tasks through large-scale pretraining and supervised fine-tuning. However, they may underperform in domains with distribution shifts and scarce labels, where supervised fine-tuning may be infeasible. While continued self-supervised learning for model adaptation is common for generative language models, this strategy has not proven effective for vision-centric encoder models. To address this challenge, w...

ID: 2510.20994v1 cs.CV, cs.AI, cs.LG

arXiv PDF

📄 Breakdance Video classification in the age of Generative AI

2025-10-25

Авторы:

Sauptik Dhar, Naveen Ramakrishnan, Michelle Munson

Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']

Annotation:

Large Vision Language models have seen huge application in several sports use-cases recently. Most of these works have been targeted towards a limited subset of popular sports like soccer, cricket, basketball etc; focusing on generative tasks like visual question answering, highlight generation. This work analyzes the applicability of the modern video foundation models (both encoder and decoder) for a very niche but hugely popular dance sports - breakdance. Our results show that Video Encoder mo...

ID: 2510.20287v1 cs.CV, cs.AI, cs.LG

arXiv PDF

📄 Towards General Modality Translation with Contrastive and Predictive Latent Diffusion Bridge

2025-10-25

Авторы:

Nimrod Berman, Omkar Joglekar, Eitan Kosman, Dotan Di Castro, Omri Azencot

Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']

Annotation:

ID: 2510.20819v1 cs.CV, cs.AI, cs.LG

arXiv PDF

Показано 121 - 130 из 358 записей