DiffVLA++: Bridging Cognitive Reasoning and End-to-End Driving through Metric-Guided Alignment

2510.17148v2 cs.RO, cs.CV 2025-10-22

Авторы:

Yu Gao, Anqing Jiang, Yiru Wang, Heng Yuwen, Wang Shuo, Sun Hao, Wang Jijun

Abstract

Conventional end-to-end (E2E) driving models are effective at generating physically plausible trajectories, but often fail to generalize to long-tail scenarios due to the lack of essential world knowledge to understand and reason about surrounding environments. In contrast, Vision-Language-Action (VLA) models leverage world knowledge to handle challenging cases, but their limited 3D reasoning capability can lead to physically infeasible actions. In this work we introduce DiffVLA++, an enhanced autonomous driving framework that explicitly bridges cognitive reasoning and E2E planning through metric-guided alignment. First, we build a VLA module directly generating semantically grounded driving trajectories. Second, we design an E2E module with a dense trajectory vocabulary that ensures physical feasibility. Third, and most critically, we introduce a metric-guided trajectory scorer that guides and aligns the outputs of the VLA and E2E modules, thereby integrating their complementary strengths. The experiment on the ICCV 2025 Autonomous Grand Challenge leaderboard shows that DiffVLA++ achieves EPDMS of 49.12.

Ссылки и действия

Читать на arXiv Скачать PDF

Дополнительные ресурсы:

DiffVLA++: Bridging Cognitive Reasoning and End-to-End Driving through Metric-Guided Alignment

Авторы:

Abstract

Ссылки и действия

Связанные статьи

From Generated Human Videos to Physically Plausible Robot Trajectories

Sign Language Recognition using Bidirectional Reservoir Computing

FOM-Nav: Frontier-Object Maps for Object Goal Navigation

Opening the Sim-to-Real Door for Humanoid Pixel-to-Action Policy Transfer

Estimation of Kinematic Motion from Dashcam Footage

Навигация