DiffVLA++: Bridging Cognitive Reasoning and End-to-End Driving through Metric-Guided Alignment
2510.17148v2
cs.RO, cs.CV
2025-10-22
Авторы:
Yu Gao, Anqing Jiang, Yiru Wang, Heng Yuwen, Wang Shuo, Sun Hao, Wang Jijun
Abstract
Conventional end-to-end (E2E) driving models are effective at generating
physically plausible trajectories, but often fail to generalize to long-tail
scenarios due to the lack of essential world knowledge to understand and reason
about surrounding environments. In contrast, Vision-Language-Action (VLA)
models leverage world knowledge to handle challenging cases, but their limited
3D reasoning capability can lead to physically infeasible actions. In this work
we introduce DiffVLA++, an enhanced autonomous driving framework that
explicitly bridges cognitive reasoning and E2E planning through metric-guided
alignment. First, we build a VLA module directly generating semantically
grounded driving trajectories. Second, we design an E2E module with a dense
trajectory vocabulary that ensures physical feasibility. Third, and most
critically, we introduce a metric-guided trajectory scorer that guides and
aligns the outputs of the VLA and E2E modules, thereby integrating their
complementary strengths. The experiment on the ICCV 2025 Autonomous Grand
Challenge leaderboard shows that DiffVLA++ achieves EPDMS of 49.12.
Ссылки и действия
Дополнительные ресурсы: