Semantic World Models
2510.19818v1
cs.LG, cs.AI, cs.RO
2025-10-24
Авторы:
Jacob Berg, Chuning Zhu, Yanda Bao, Ishan Durugkar, Abhishek Gupta
Abstract
Planning with world models offers a powerful paradigm for robotic control.
Conventional approaches train a model to predict future frames conditioned on
current frames and actions, which can then be used for planning. However, the
objective of predicting future pixels is often at odds with the actual planning
objective; strong pixel reconstruction does not always correlate with good
planning decisions. This paper posits that instead of reconstructing future
frames as pixels, world models only need to predict task-relevant semantic
information about the future. For such prediction the paper poses world
modeling as a visual question answering problem about semantic information in
future frames. This perspective allows world modeling to be approached with the
same tools underlying vision language models. Thus vision language models can
be trained as "semantic" world models through a supervised finetuning process
on image-action-text data, enabling planning for decision-making while
inheriting many of the generalization and robustness properties from the
pretrained vision-language models. The paper demonstrates how such a semantic
world model can be used for policy improvement on open-ended robotics tasks,
leading to significant generalization improvements over typical paradigms of
reconstruction-based action-conditional world modeling. Website available at
https://weirdlabuw.github.io/swm.
Ссылки и действия
Дополнительные ресурсы: