📊 Статистика дайджестов

Всего дайджестов: 34022 Добавлено сегодня: 82

Последнее обновление: сегодня

📄 Beyond Monolithic Rewards: A Hybrid and Multi-Aspect Reward Optimization for MLLM Alignment

2025-10-09

Авторы:

Radha Gulhane, Sathish Reddy Indurthi

Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']

Annotation:

Aligning multimodal large language models (MLLMs) with human preferences often relies on single-signal, model-based reward methods. Such monolithic rewards often lack confidence calibration across domain-specific tasks, fail to capture diverse aspects of human preferences, and require extensive data annotation and reward model training. In this work, we propose a hybrid reward modeling framework that integrates complementary reward paradigms: (i) model-based rewards, where a learned reward model...

ID: 2510.05283v1 cs.AI, cs.CL, cs.CV

arXiv PDF

📄 Bridging the Gap Between Multimodal Foundation Models and World Models

2025-10-08

Авторы:

Xuehai He

Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']

Annotation:

Humans understand the world through the integration of multiple sensory modalities, enabling them to perceive, reason about, and imagine dynamic physical processes. Inspired by this capability, multimodal foundation models (MFMs) have emerged as powerful tools for multimodal understanding and generation. However, today's MFMs fall short of serving as effective world models. They lack the essential ability such as perform counterfactual reasoning, simulate dynamics, understand the spatiotemporal ...

ID: 2510.03727v1 cs.AI, cs.CL, cs.CV, cs.LG

arXiv PDF

📄 Partial Information Decomposition via Normalizing Flows in Latent Gaussian Distributions

2025-10-08

Авторы:

Wenyuan Zhao, Adithya Balachandran, Chao Tian, Paul Pu Liang

Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']

Annotation:

The study of multimodality has garnered significant interest in fields where the analysis of interactions among multiple information sources can enhance predictive modeling, data fusion, and interpretability. Partial information decomposition (PID) has emerged as a useful information-theoretic framework to quantify the degree to which individual modalities independently, redundantly, or synergistically convey information about a target variable. However, existing PID methods depend on optimizing...

ID: 2510.04417v1 cs.LG, cs.AI, cs.CL, cs.CV, cs.IT, math.IT

arXiv PDF

📄 SIMSplat: Predictive Driving Scene Editing with Language-aligned 4D Gaussian Splatting

2025-10-07

Авторы:

Sung-Yeon Park, Adam Lee, Juanwu Lu, Can Cui, Luyang Jiang, Rohit Gupta, Kyungtae Han, Ahmadreza Moradipari, Ziran Wang

Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']

Annotation:

Driving scene manipulation with sensor data is emerging as a promising alternative to traditional virtual driving simulators. However, existing frameworks struggle to generate realistic scenarios efficiently due to limited editing capabilities. To address these challenges, we present SIMSplat, a predictive driving scene editor with language-aligned Gaussian splatting. As a language-controlled editor, SIMSplat enables intuitive manipulation using natural language prompts. By aligning language wit...

ID: 2510.02469v1 cs.RO, cs.AI, cs.CL, cs.CV

arXiv PDF

📄 The Reasoning Boundary Paradox: How Reinforcement Learning Constrains Language Models

2025-10-04

Авторы:

Phuc Minh Nguyen, Chinh D. La, Duy M. H. Nguyen, Nitesh V. Chawla, Binh T. Nguyen, Khoa D. Doan

Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']

Annotation:

Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a key method for improving Large Language Models' reasoning capabilities, yet recent evidence suggests it may paradoxically shrink the reasoning boundary rather than expand it. This paper investigates the shrinkage issue of RLVR by analyzing its learning dynamics and reveals two critical phenomena that explain this failure. First, we expose negative interference in RLVR, where learning to solve certain training problems activel...

ID: 2510.02230v1 cs.AI, cs.CL, cs.CV

arXiv PDF

📄 The Unreasonable Effectiveness of Scaling Agents for Computer Use

2025-10-04

Авторы:

Gonzalo Gonzalez-Pumariega, Vincent Tu, Chih-Lun Lee, Jiachen Yang, Ang Li, Xin Eric Wang

Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']

Annotation:

Computer-use agents (CUAs) hold promise for automating everyday digital tasks, but their unreliability and high variance hinder their application to long-horizon, complex tasks. We introduce Behavior Best-of-N (bBoN), a method that scales over agents by generating multiple rollouts and selecting among them using behavior narratives that describe the agents' rollouts. It enables both wide exploration and principled trajectory selection, substantially improving robustness and success rates. On OSW...

ID: 2510.02250v1 cs.AI, cs.CL, cs.CV, cs.LG

arXiv PDF

📄 IRIS: Intrinsic Reward Image Synthesis

2025-10-02

Авторы:

Yihang Chen, Yuanhao Ban, Yunqi Hong, Cho-Jui Hsieh

Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']

Annotation:

Despite the success of Reinforcement Learning from Human Feedback (RLHF) in language reasoning, its application to autoregressive Text-to-Image (T2I) generation is often constrained by the limited availability of human preference data. This paper explores how an autoregressive T2I model can learn from internal signals without relying on external rewards or labeled data. Contrary to recent findings in text generation, we show that maximizing self-uncertainty, rather than self-certainty, improves ...

ID: 2509.25562v1 cs.AI, cs.CL, cs.CV, cs.LG

arXiv PDF

📄 Skip-It? Theoretical Conditions for Layer Skipping in Vision-Language Models

2025-10-02

Авторы:

Max Hartman, Vidhata Jayaraman, Moulik Choraria, Akhil Bhimaraju, Lav R. Varshney

Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']

Annotation:

Vision-language models (VLMs) achieve incredible performance across a wide range of tasks, but their large size makes inference costly. Recent work shows that selectively skipping VLM layers can improve efficiency with minimal performance loss or even performance improvements. However, this technique remains underused due to the limited understanding of when layer skipping is beneficial. In this paper, we develop a framework that uses information and learning theory to characterize the condition...

ID: 2509.25584v1 cs.AI, cs.CL, cs.CV, cs.IT, cs.LG, math.IT

arXiv PDF

📄 NePTune: A Neuro-Pythonic Framework for Tunable Compositional Reasoning on Vision-Language

2025-10-02

Авторы:

Danial Kamali, Parisa Kordjamshidi

Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']

Annotation:

Modern Vision-Language Models (VLMs) have achieved impressive performance in various tasks, yet they often struggle with compositional reasoning, the ability to decompose and recombine concepts to solve novel problems. While neuro-symbolic approaches offer a promising direction, they are typically constrained by crisp logical execution or predefined predicates, which limit flexibility. In this work, we introduce NePTune, a neuro-symbolic framework that overcomes these limitations through a hybri...

ID: 2509.25757v1 cs.AI, cs.CL, cs.CV, cs.SC

arXiv PDF

📄 VSSFlow: Unifying Video-conditioned Sound and Speech Generation via Joint Learning

2025-10-01

Авторы:

Xin Cheng, Yuyue Wang, Xihua Wang, Yihan Wu, Kaisi Guan, Yijing Chen, Peng Zhang, Xiaojiang Liu, Meng Cao, Ruihua Song

#### Контекст Видео-условная генерация звука и речи (Video-conditioned Sound and Speech Generation, VSS) является ключевым направлением в искусственном интеллекте, включая задачи видео-к-звуку (V2S) и визуальной текстовой речи (Visual Text-to-Speech, VisualTTS). Однако, существующие подходы обычно рассматривают эти задачи в отдельности, не добиваясь гармоничного взаимодействия. Это приводит к неэффективности, требованию дополнительных ресурсов и усложнению обучения. Таким образом, сцепление этих задач в единую модель остается актуальной проблемой. Наша мотивация заключается в разработке модели, которая будет эффективно объединять V2S и VisualTTS в единое целое, уменьшая сложность и улучшая качество генерируемых данных. #### Метод Мы предлагаем VSSFlow — модель, основанную на методе течения (flow-matching framework). Эта модель объединяет обе задачи в единый процесс, стремясь к более эффективной интеграции условий. Основным инновационным элементом является уникальный механизм агрегации условий (condition aggregation mechanism), который позволяет эффективно обрабатывать разные типы входных данных, таких как видео и речевые транскрипты. Было выявлено, что разные слои сети (cross-attention и self-attention) демонстрируют разные индуктивные базы при вводе условий. Мы используем эти свойства для эффективного управления: cross-attention для неоднозначных видео-условий и self-attention для более определенных речевых транскриптов. Более того, нами открыто опровергнут миф о том, что усложнение модели для объединения задач приводит к ухудшению качества — VSSFlow благодаря единому циклу обучения демонстрирует более стабильный результат и ускоренное сходимость. #### Результаты Мы проводили эксперименты на задачах V2S и VisualTTS, используя стандартные наборы данных. Наши результаты показывают, что VSSFlow превосходит существующие специализированные модели, устанавливая новые рекорды качества. Особое внимание уделено выявлению преимуществ общего аудио-примитива, который ускоряет обучение, обеспечивает более точное подгонение по условиям и обеспечивает более стабильное генерирование. Эксперименты также подтверждают, что у нас предложенный подход значительно упрощает обучение и улучшает качество генерируемых данных, без дополнительных этапов обучения. #### Значимость Выделяется широкая область применений VSSFlow, включая домашние ассистенты, развлекательные приложения, медицинскую индустрию и искусственные контент-генераторы. Наш подход уникален тем, что объединяет две ранее разделенные задачи в единое решение, сокращая ресурсозатраты и улучшая качество. Преимущества заключаются в простоте развертывания, улучшенной стабильности и улуч

Annotation:

Video-conditioned sound and speech generation, encompassing video-to-sound (V2S) and visual text-to-speech (VisualTTS) tasks, are conventionally addressed as separate tasks, with limited exploration to unify them within a signle framework. Recent attempts to unify V2S and VisualTTS face challenges in handling distinct condition types (e.g., heterogeneous video and transcript conditions) and require complex training stages. Unifying these two tasks remains an open problem. To bridge this gap, we ...

ID: 2509.24773v2 eess.AS, cs.AI, cs.CL, cs.CV, cs.SD

arXiv PDF

Показано 31 - 40 из 64 записей