📊 Статистика дайджестов

Всего дайджестов: 34123 Добавлено сегодня: 101

Последнее обновление: сегодня

📄 Spatially-Grounded Document Retrieval via Patch-to-Region Relevance Propagation

2025-12-04

Авторы:

Agathoklis Georgiou

Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']

Annotation:

Vision-language models (VLMs) like ColPali achieve state-of-the-art document retrieval by embedding pages as images and computing fine-grained similarity between query tokens and visual patches. However, they return entire pages rather than specific regions, limiting utility for retrieval-augmented generation (RAG) where precise context is paramount. Conversely, OCR-based systems extract structured text with bounding box coordinates but lack semantic grounding for relevance assessment. We propos...

ID: 2512.02660v1 cs.CV, cs.IR

arXiv PDF

📄 SciPostGen: Bridging the Gap between Scientific Papers and Poster Layouts

2025-12-02

Авторы:

Shun Inadumi, Shohei Tanaka, Tosho Hirasawa, Atsushi Hashimoto, Koichiro Yoshino, Yoshitaka Ushiku

Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']

Annotation:

As the number of scientific papers continues to grow, there is a demand for approaches that can effectively convey research findings, with posters serving as a key medium for presenting paper contents. Poster layouts determine how effectively research is communicated and understood, highlighting their growing importance. In particular, a gap remains in understanding how papers correspond to the layouts that present them, which calls for datasets with paired annotations at scale. To bridge this g...

ID: 2511.22490v1 cs.CV, cs.IR

arXiv PDF

📄 CNN-Based Framework for Pedestrian Age and Gender Classification Using Far-View Surveillance in Mixed-Traffic Intersections

2025-12-02

Авторы:

Shisir Shahriar Arif, Md. Muhtashim Shahrier, Nazmul Haque, Md Asif Raihan, Md. Hadiuzzaman

Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']

Annotation:

Pedestrian safety remains a pressing concern in congested urban intersections, particularly in low- and middle-income countries where traffic is multimodal, and infrastructure often lacks formal control. Demographic factors like age and gender significantly influence pedestrian vulnerability, yet real-time monitoring systems rarely capture this information. To address this gap, this study proposes a deep learning framework that classifies pedestrian age group and gender from far-view intersectio...

ID: 2511.22873v1 cs.CV, cs.IR

arXiv PDF

📄 HyM-UNet: Synergizing Local Texture and Global Context via Hybrid CNN-Mamba Architecture for Medical Image Segmentation

2025-11-25

Авторы:

Haodong Chen, Xianfei Han, Qwen

Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']

Annotation:

Accurate organ and lesion segmentation is a critical prerequisite for computer-aided diagnosis. Convolutional Neural Networks (CNNs), constrained by their local receptive fields, often struggle to capture complex global anatomical structures. To tackle this challenge, this paper proposes a novel hybrid architecture, HyM-UNet, designed to synergize the local feature extraction capabilities of CNNs with the efficient global modeling capabilities of Mamba. Specifically, we design a Hierarchical Enc...

ID: 2511.17988v1 cs.CV, cs.IR

arXiv PDF

📄 A Little More Like This: Text-to-Image Retrieval with Vision-Language Models Using Relevance Feedback

2025-11-25

Авторы:

Bulat Khaertdinov, Mirela Popa, Nava Tintarev

Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']

Annotation:

Large vision-language models (VLMs) enable intuitive visual search using natural language queries. However, improving their performance often requires fine-tuning and scaling to larger model variants. In this work, we propose a mechanism inspired by traditional text-based search to improve retrieval performance at inference time: relevance feedback. While relevance feedback can serve as an alternative to fine-tuning, its model-agnostic design also enables use with fine-tuned VLMs. Specifically, ...

ID: 2511.17255v1 cs.CV, cs.IR

arXiv PDF

📄 FGNet: Leveraging Feature-Guided Attention to Refine SAM2 for 3D EM Neuron Segmentation

2025-11-19

Авторы:

Zhenghua Li, Hang Chen, Zihao Sun, Kai Li, Xiaolin Hu

Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']

Annotation:

Accurate segmentation of neural structures in Electron Microscopy (EM) images is paramount for neuroscience. However, this task is challenged by intricate morphologies, low signal-to-noise ratios, and scarce annotations, limiting the accuracy and generalization of existing methods. To address these challenges, we seek to leverage the priors learned by visual foundation models on a vast amount of natural images to better tackle this task. Specifically, we propose a novel framework that can effect...

ID: 2511.13063v1 cs.CV, cs.IR

arXiv PDF

📄 Region-Point Joint Representation for Effective Trajectory Similarity Learning

2025-11-19

Авторы:

Hao Long, Silin Zhou, Lisi Chen, Shuo Shang

Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']

Annotation:

Recent learning-based methods have reduced the computational complexity of traditional trajectory similarity computation, but state-of-the-art (SOTA) methods still fail to leverage the comprehensive spectrum of trajectory information for similarity modeling. To tackle this problem, we propose \textbf{RePo}, a novel method that jointly encodes \textbf{Re}gion-wise and \textbf{Po}int-wise features to capture both spatial context and fine-grained moving patterns. For region-wise representation, the...

ID: 2511.13125v1 cs.CV, cs.IR, cs.LG

arXiv PDF

📄 Large Language Models Meet Extreme Multi-label Classification: Scaling and Multi-modal Framework

2025-11-19

Авторы:

Diego Ortego, Marlon Rodríguez, Mario Almagro, Kunal Dahiya, David Jiménez, Juan C. SanMiguel

Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']

Annotation:

Foundation models have revolutionized artificial intelligence across numerous domains, yet their transformative potential remains largely untapped in Extreme Multi-label Classification (XMC). Queries in XMC are associated with relevant labels from extremely large label spaces, where it is critical to strike a balance between efficiency and performance. Therefore, many recent approaches efficiently pose XMC as a maximum inner product search between embeddings learned from small encoder-only trans...

ID: 2511.13189v1 cs.CV, cs.IR

arXiv PDF

📄 TabRAG: Tabular Document Retrieval via Structured Language Representations

2025-11-15

Авторы:

Jacob Si, Mike Qu, Michelle Lee, Yingzhen Li

Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']

Annotation:

Ingesting data for Retrieval-Augmented Generation (RAG) involves either fine-tuning the embedding model directly on the target corpus or parsing documents for embedding model encoding. The former, while accurate, incurs high computational hardware requirements, while the latter suffers from suboptimal performance when extracting tabular data. In this work, we address the latter by presenting TabRAG, a parsing-based RAG pipeline designed to tackle table-heavy documents via structured language rep...

ID: 2511.06582v1 cs.CL, cs.AI, cs.CV, cs.IR, cs.LG

arXiv PDF

📄 Compression then Matching: An Efficient Pre-training Paradigm for Multimodal Embedding

2025-11-15

Авторы:

Da Li, Yuxiao Luo, Keping Bi, Jiafeng Guo, Wei Yuan, Biao Yang, Yan Wang, Fan Yang, Tingting Gao, Guorui Zhou

Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']

Annotation:

Vision-language models advance multimodal representation learning by acquiring transferable semantic embeddings, thereby substantially enhancing performance across a range of vision-language tasks, including cross-modal retrieval, clustering, and classification. An effective embedding is expected to comprehensively preserve the semantic content of the input while simultaneously emphasizing features that are discriminative for downstream tasks. Recent approaches demonstrate that VLMs can be adapt...

ID: 2511.08480v1 cs.CV, cs.IR

arXiv PDF

Показано 1 - 10 из 35 записей