📊 Статистика дайджестов
Всего дайджестов: 34022 Добавлено сегодня: 82
Последнее обновление: сегодня
📄 MedCLM: Learning to Localize and Reason via a CoT-Curriculum in Medical Vision-Language Models
2025-10-08Авторы:
Soo Yong Kim, Suin Cho, Vincent-Daniel Yun, Gyeongyeon Hwang
Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']
Annotation:
Bridging clinical diagnostic reasoning with AI remains a central challenge in
medical imaging. We introduce MedCLM, an automated pipeline that converts
detection datasets into large-scale medical visual question answering (VQA)
data with Chain-of-Thought (CoT) reasoning by linking lesion boxes to organ
segmentation and structured rationales. These contextual signals enable medical
vision-language models to generate question-answer pairs with step-by-step
reasoning. To utilize this data effective...
Авторы:
Zeyu Zhu, Kevin Qinghong Lin, Mike Zheng Shou
Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']
Annotation:
Academic presentation videos have become an essential medium for research
communication, yet producing them remains highly labor-intensive, often
requiring hours of slide design, recording, and editing for a short 2 to 10
minutes video. Unlike natural video, presentation video generation involves
distinctive challenges: inputs from research papers, dense multi-modal
information (text, figures, tables), and the need to coordinate multiple
aligned channels such as slides, subtitles, speech, and hu...
Авторы:
Zhiting Mei, Ola Shorinwa, Anirudha Majumdar
Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']
Annotation:
Generative video models demonstrate impressive text-to-video capabilities,
spurring widespread adoption in many real-world applications. However, like
large language models (LLMs), video generation models tend to hallucinate,
producing plausible videos even when they are factually wrong. Although
uncertainty quantification (UQ) of LLMs has been extensively studied in prior
work, no UQ method for video models exists, raising critical safety concerns.
To our knowledge, this paper represents the fi...
Авторы:
Jingyuan Deng, Yujiu Yang
Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']
Annotation:
Large vision-language models (LVLMs) have shown remarkable performance in
visual-language understanding for downstream multimodal tasks. While their
capabilities are improving, problems emerge simultaneously. Among those
problems, the hallucinations have attracted much attention, which stands for
the phenomenon where LVLMs generate contradictory content to their input visual
and text contents. Many approaches have been proposed to deal with this issue,
such as contrastive decoding and attention ...
Авторы:
Divy Kala, Eshika Khandelwal, Makarand Tapaswi
Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']
Annotation:
Audio descriptions (ADs) narrate important visual details in movies, enabling
Blind and Low Vision (BLV) users to understand narratives and appreciate visual
details. Existing works in automatic AD generation mostly focus on few-second
trimmed clips, and evaluate them by comparing against a single ground-truth
reference AD. However, writing ADs is inherently subjective. Through alignment
and analysis of two independent AD tracks for the same movies, we quantify the
subjectivity in when and wheth...
Авторы:
Kevin Zhang, Kuangzhi Ge, Xiaowei Chi, Renrui Zhang, Shaojun Shi, Zhen Dong, Sirui Han, Shanghang Zhang
Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']
Annotation:
Trained on internet-scale video data, generative world models are
increasingly recognized as powerful world simulators that can generate
consistent and plausible dynamics over structure, motion, and physics. This
raises a natural question: with the advent of strong video foundational models,
might they supplant conventional vision encoder paradigms for general-purpose
multimodal understanding? While recent studies have begun to explore the
potential of world models on common vision tasks, these ...
📄 Authentic Discrete Diffusion Model
2025-10-04Авторы:
Xiao Li, Jiaqi Zhang, Shuxiang Zhang, Tianshui Chen, Liang Lin, Guangrun Wang
Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']
Annotation:
We propose an Authentic Discrete Diffusion (ADD) framework that fundamentally
redefines prior pseudo-discrete approaches by preserving core diffusion
characteristics directly in the one-hot space through a suite of coordinated
mechanisms. Unlike conventional "pseudo" discrete diffusion (PDD) methods, ADD
reformulates the diffusion input by directly using float-encoded one-hot class
data, without relying on diffusing in the continuous latent spaces or masking
policies. At its core, a timestep-con...
Авторы:
Yanzhe Chen, Kevin Qinghong Lin, Mike Zheng Shou
Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']
Annotation:
While recent generative models advance pixel-space video synthesis, they
remain limited in producing professional educational videos, which demand
disciplinary knowledge, precise visual structures, and coherent transitions,
limiting their applicability in educational scenarios. Intuitively, such
requirements are better addressed through the manipulation of a renderable
environment, which can be explicitly controlled via logical commands (e.g.,
code). In this work, we propose Code2Video, a code-c...
Авторы:
Basem Rizk, Joel Walsh, Mark Core, Benjamin Nye
Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']
Annotation:
Analysis of multi-modal content can be tricky, computationally expensive, and
require a significant amount of engineering efforts. Lots of work with
pre-trained models on static data is out there, yet fusing these opensource
models and methods with complex data such as videos is relatively challenging.
In this paper, we present a framework that enables efficiently prototyping
pipelines for multi-modal content analysis. We craft a candidate recipe for a
pipeline, marrying a set of pre-trained mod...
Авторы:
Shijie Lian, Changti Wu, Laurence Tianruo Yang, Hang Yuan, Bin Yu, Lei Zhang, Kai Chen
Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']
Annotation:
Spatial intelligence spans a rich suite of abilities, including visualising
and transforming shapes, mentally rotating objects, judging relational
positions and containment, and estimating numerosity. However, it still remains
a critical unresolved challenge for Multimodal Large Language Models (MLLMs).To
fill this gap, we propose to treat Euclidean geometry problem-solving as a
surrogate task. Specifically, we meticulously constructed a curated multimodal
dataset, called Euclid30K, comprising a...
Показано 81 -
90
из 161 записей