From Code to Action: Hierarchical Learning of Diffusion-VLM Policies
2509.24917v1
cs.RO, cs.LG
2025-10-01
Авторы:
Markus Peschl, Pietro Mazzaglia, Daniel Dijkman
Abstract
Imitation learning for robotic manipulation often suffers from limited
generalization and data scarcity, especially in complex, long-horizon tasks. In
this work, we introduce a hierarchical framework that leverages code-generating
vision-language models (VLMs) in combination with low-level diffusion policies
to effectively imitate and generalize robotic behavior. Our key insight is to
treat open-source robotic APIs not only as execution interfaces but also as
sources of structured supervision: the associated subtask functions - when
exposed - can serve as modular, semantically meaningful labels. We train a VLM
to decompose task descriptions into executable subroutines, which are then
grounded through a diffusion policy trained to imitate the corresponding robot
behavior. To handle the non-Markovian nature of both code execution and certain
real-world tasks, such as object swapping, our architecture incorporates a
memory mechanism that maintains subtask context across time. We find that this
design enables interpretable policy decomposition, improves generalization when
compared to flat policies and enables separate evaluation of high-level
planning and low-level control.
Ссылки и действия
Дополнительные ресурсы: