📊 Статистика дайджестов

Всего дайджестов: 34022 Добавлено сегодня: 0

Последнее обновление: сегодня

📄 Mechanistic Finetuning of Vision-Language-Action Models via Few-Shot Demonstrations

2025-12-02

Авторы:

Chancharik Mitra, Yusen Luo, Raj Saravanan, Dantong Niu, Anirudh Pai, Jesse Thomason, Trevor Darrell, Abrar Anwar, Deva Ramanan, Roei Herzig

Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']

Annotation:

Vision-Language Action (VLAs) models promise to extend the remarkable success of vision-language models (VLMs) to robotics. Yet, unlike VLMs in the vision-language domain, VLAs for robotics require finetuning to contend with varying physical factors like robot embodiment, environment characteristics, and spatial relationships of each task. Existing fine-tuning methods lack specificity, adapting the same set of parameters regardless of a task's visual, linguistic, and physical characteristics. In...

ID: 2511.22697v1 cs.RO, cs.CL, cs.CV

arXiv PDF

📄 Robot Confirmation Generation and Action Planning Using Long-context Q-Former Integrated with Multimodal LLM

2025-11-25

Авторы:

Chiori Hori, Yoshiki Masuyama, Siddarth Jain, Radu Corcodel, Devesh Jha, Diego Romeres, Jonathan Le Roux

Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']

Annotation:

Human-robot collaboration towards a shared goal requires robots to understand human action and interaction with the surrounding environment. This paper focuses on human-robot interaction (HRI) based on human-robot dialogue that relies on the robot action confirmation and action step generation using multimodal scene understanding. The state-of-the-art approach uses multimodal transformers to generate robot action steps aligned with robot action confirmation from a single clip showing a task comp...

ID: 2511.17335v1 cs.RO, cs.CL, cs.CV, cs.SD, eess.AS

arXiv PDF

📄 SRPO: Self-Referential Policy Optimization for Vision-Language-Action Models

2025-11-21

Авторы:

Senyu Fei, Siyin Wang, Li Ji, Ao Li, Shiduo Zhang, Liming Liu, Jinlong Hou, Jingjing Gong, Xianzhong Zhao, Xipeng Qiu

Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']

Annotation:

Vision-Language-Action (VLA) models excel in robotic manipulation but are constrained by their heavy reliance on expert demonstrations, leading to demonstration bias and limiting performance. Reinforcement learning (RL) is a vital post-training strategy to overcome these limits, yet current VLA-RL methods, including group-based optimization approaches, are crippled by severe reward sparsity. Relying on binary success indicators wastes valuable information in failed trajectories, resulting in low...

ID: 2511.15605v1 cs.RO, cs.CL, cs.CV

arXiv PDF

📄 MiMo-Embodied: X-Embodied Foundation Model Technical Report

2025-11-21

Авторы:

Xiaoshuai Hao, Lei Zhou, Zhijian Huang, Zhiwen Hou, Yingbo Tang, Lingfeng Zhang, Guang Li, Zheng Lu, Shuhuai Ren, Xianhui Meng, Yuchen Zhang, Jing Wu, Jinghui Lu, Chenxu Dang, Jiayi Guan, Jianhua Wu, Zhiyi Hou, Hanbing Li, Shumeng Xia, Mingliang Zhou, Yinan Zheng, Zihao Yue, Shuhao Gu, Hao Tian, Yuannan Shen, Jianwei Cui, Wen Zhang, Shaoqing Xu, Bing Wang, Haiyang Sun, Zeyu Zhu, Yuncheng Jiang, Zibin Guo, Chuhong Gong, Chaofan Zhang, Wenbo Ding, Kun Ma, Guang Chen, Rui Cai, Diyun Xiang, Heng Qu, Fuli Luo, Hangjun Ye, Long Chen

Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']

Annotation:

We open-source MiMo-Embodied, the first cross-embodied foundation model to successfully integrate and achieve state-of-the-art performance in both Autonomous Driving and Embodied AI. MiMo-Embodied sets new records across 17 embodied AI benchmarks in Task Planning, Affordance Prediction and Spatial Understanding, while also excelling in 12 autonomous driving benchmarks across Environmental Perception, Status Prediction, and Driving Planning. Across these tasks, MiMo-Embodied significantly outperf...

ID: 2511.16518v1 cs.RO, cs.CL, cs.CV

arXiv PDF

📄 RoboOmni: Proactive Robot Manipulation in Omni-modal Context

2025-10-30

Авторы:

Siyin Wang, Jinlan Fu, Feihong Liu, Xinzhe He, Huangxuan Wu, Junhao Shi, Kexin Huang, Zhaoye Fei, Jingjing Gong, Zuxuan Wu, Yugang Jiang, See-Kiong Ng, Tat-Seng Chua, Xipeng Qiu

Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']

Annotation:

Recent advances in Multimodal Large Language Models (MLLMs) have driven rapid progress in Vision-Language-Action (VLA) models for robotic manipulation. Although effective in many scenarios, current approaches largely rely on explicit instructions, whereas in real-world interactions, humans rarely issue instructions directly. Effective collaboration requires robots to infer user intentions proactively. In this work, we introduce cross-modal contextual instructions, a new setting where intent is d...

ID: 2510.23763v2 cs.RO, cs.CL, cs.CV

arXiv PDF

📄 UNCAP: Uncertainty-Guided Planning Using Natural Language Communication for Cooperative Autonomous Vehicles

2025-10-17

Авторы:

Neel P. Bhatt, Po-han Li, Kushagra Gupta, Rohan Siva, Daniel Milan, Alexander T. Hogue, Sandeep P. Chinchali, David Fridovich-Keil, Zhangyang Wang, Ufuk Topcu

Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']

Annotation:

Safe large-scale coordination of multiple cooperative connected autonomous vehicles (CAVs) hinges on communication that is both efficient and interpretable. Existing approaches either rely on transmitting high-bandwidth raw sensor data streams or neglect perception and planning uncertainties inherent in shared data, resulting in systems that are neither scalable nor safe. To address these limitations, we propose Uncertainty-Guided Natural Language Cooperative Autonomous Planning (UNCAP), a visio...

ID: 2510.12992v1 cs.RO, cs.CL, cs.CV, cs.MA

arXiv PDF

📄 LIBERO-Plus: In-depth Robustness Analysis of Vision-Language-Action Models

2025-10-17

Авторы:

Senyu Fei, Siyin Wang, Junhao Shi, Zihao Dai, Jikun Cai, Pengfang Qian, Li Ji, Xinzhe He, Shiduo Zhang, Zhaoye Fei, Jinlan Fu, Jingjing Gong, Xipeng Qiu

Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']

Annotation:

Visual-Language-Action (VLA) models report impressive success rates on robotic manipulation benchmarks, yet these results may mask fundamental weaknesses in robustness. We perform a systematic vulnerability analysis by introducing controlled perturbations across seven dimensions: objects layout, camera viewpoints, robot initial states, language instructions, light conditions, background textures and sensor noise. We comprehensively analyzed multiple state-of-the-art models and revealed consisten...

ID: 2510.13626v1 cs.RO, cs.CL, cs.CV

arXiv PDF

📄 The Better You Learn, The Smarter You Prune: Towards Efficient Vision-language-action Models via Differentiable Token Pruning

2025-09-18

Авторы:

Titong Jiang, Xuefeng Jiang, Yuan Ma, Xin Wen, Bailin Li, Kun Zhan, Peng Jia, Yahui Liu, Sheng Sun, Xianpeng Lang

## Контекст Визуально-языково-действительные (Vision-Language-Action, VLA) модели представляют собой мощные инструменты для выполнения сложных задач в реальном мире, особенно в сфере робототехники. Однако эти модели часто сталкиваются с проблемами эффективности, возникающими из-за тяжеловесной вычислительной нагрузки, связанной с использованием внимательных методов (attention-based methods) на больших множествах визуальных токенов. Эта проблема становится особенно критической при развертывании на ресурс-ограниченных платформах, таких как мобильные устройства или роботы с ограниченными вычислительными возможностями. Напрямую решать эту проблему требует создания методов, способных эффективно снижать нагрузку, не ухудшая получаемые результаты. Наше исследование сосредоточено на развитии такого подхода, способного обеспечить эффективность в реальном времени и сохранить высокую точность выполнения задач. ## Метод Мы предлагаем LightVLA — простой, но эффективный разностиальный (differentiable) метод токен-преобразования (token pruning) для VLA-моделей. Основным идейным принципом LightVLA является адаптивное удаление ненужных токенов в процессе работы модели, чтобы сократить вычислительную нагрузку без потери точности. Реализация этого подхода основывается на динамическом определении важности токенов с помощью динамических запросов (queries) и применении Gumbel-softmax для различения токенов. Это позволяет модели самостоятельно "учиться" поддерживать только наиболее важные токены для того, чтобы выполнить задачу. Этот процесс не требует дополнительных параметров для обучения и может быть интегрирован с любыми современными инференсными фреймворками. ## Результаты Мы провели эксперименты на LIBERO бенчмарке, сравнив LightVLA с другими VLA-моделями и существующими методами токен-преобразования. Результаты показали, что LightVLA не только повышает успешность выполнения задач, но и значительно уменьшает объем вычислений (FLOPs) и задержки (latency). Точнее, LightVLA снижает FLOPs и latency на 59.1% и 38.2% соответственно, при этом повышая успешность выполнения задач на 2.9%. Эти результаты указывают на успешное достижение сбалансированного соотношения эффективности и точности в работе модели. Для дальнейшего исследования, мы также проанализировали особенности learnable query-based pruning метода LightVLA*, который также показал высокую эффективность. ## Значимость LightVLA открывает новые возможности для использования VLA-моделей в реальном времени, особенно на ресурс-ограниченных платформах. Он привносит значительные преимущества в области уменьшения требований к вычислениям и повышения эффективности при выполнении задач. Это может привести к расши

Annotation:

We present LightVLA, a simple yet effective differentiable token pruning framework for vision-language-action (VLA) models. While VLA models have shown impressive capability in executing real-world robotic tasks, their deployment on resource-constrained platforms is often bottlenecked by the heavy attention-based computation over large sets of visual tokens. LightVLA addresses this challenge through adaptive, performance-driven pruning of visual tokens: It generates dynamic queries to evaluate v...

ID: 2509.12594v1 cs.RO, cs.CL, cs.CV

arXiv PDF