Simulation to Rules: A Dual-VLM Framework for Formal Visual Planning
2510.03182v1
cs.RO, cs.AI, cs.CL, cs.SC
2025-10-07
Авторы:
Yilun Hao, Yongchao Chen, Chuchu Fan, Yang Zhang
Abstract
Vision Language Models (VLMs) show strong potential for visual planning but
struggle with precise spatial and long-horizon reasoning. In contrast, Planning
Domain Definition Language (PDDL) planners excel at long-horizon formal
planning, but cannot interpret visual inputs. Recent works combine these
complementary advantages by enabling VLMs to turn visual planning problems into
PDDL files for formal planning. However, while VLMs can generate PDDL problem
files satisfactorily, they struggle to accurately generate the PDDL domain
files, which describe all the planning rules. As a result, prior methods rely
on human experts to predefine domain files or on constant environment access
for refinement. We propose VLMFP, a Dual-VLM-guided framework that can
autonomously generate both PDDL problem and domain files for formal visual
planning. VLMFP introduces two VLMs to ensure reliable PDDL file generation: A
SimVLM that simulates action consequences based on input rule descriptions, and
a GenVLM that generates and iteratively refines PDDL files by comparing the
PDDL and SimVLM execution results. VLMFP unleashes multiple levels of
generalizability: The same generated PDDL domain file works for all the
different instances under the same problem, and VLMs generalize to different
problems with varied appearances and rules. We evaluate VLMFP with 6 grid-world
domains and test its generalization to unseen instances, appearance, and game
rules. On average, SimVLM accurately describes 95.5%, 82.6% of scenarios,
simulates 85.5%, 87.8% of action sequence, and judges 82.4%, 85.6% goal
reaching for seen and unseen appearances, respectively. With the guidance of
SimVLM, VLMFP can generate PDDL files to reach 70.0%, 54.1% valid plans for
unseen instances in seen and unseen appearances, respectively. Project page:
https://sites.google.com/view/vlmfp.