BLIP-FusePPO: A Vision-Language Deep Reinforcement Learning Framework for Lane Keeping in Autonomous Vehicles
2510.22370v1
cs.RO, cs.AI, cs.CV, cs.LG, cs.SE
2025-10-29
Авторы:
Seyed Ahmad Hosseini Miangoleh, Amin Jalal Aghdasian, Farzaneh Abdollahi
Abstract
In this paper, we propose Bootstrapped Language-Image Pretraining-driven
Fused State Representation in Proximal Policy Optimization (BLIP-FusePPO), a
novel multimodal reinforcement learning (RL) framework for autonomous
lane-keeping (LK), in which semantic embeddings generated by a vision-language
model (VLM) are directly fused with geometric states, LiDAR observations, and
Proportional-Integral-Derivative-based (PID) control feedback within the agent
observation space. The proposed method lets the agent learn driving rules that
are aware of their surroundings and easy to understand by combining high-level
scene understanding from the VLM with low-level control and spatial signals.
Our architecture brings together semantic, geometric, and control-aware
representations to make policy learning more robust. A hybrid reward function
that includes semantic alignment, LK accuracy, obstacle avoidance, and speed
regulation helps learning to be more efficient and generalizable. Our method is
different from the approaches that only use semantic models to shape rewards.
Instead, it directly embeds semantic features into the state representation.
This cuts down on expensive runtime inference and makes sure that semantic
guidance is always available. The simulation results show that the proposed
model is better at LK stability and adaptability than the best vision-based and
multimodal RL baselines in a wide range of difficult driving situations. We
make our code publicly available.