SPACeR: Self-Play Anchoring with Centralized Reference Models
2510.18060v1
cs.LG, cs.AI, cs.RO, I.2.9; I.2.6
2025-10-23
Авторы:
Wei-Jer Chang, Akshay Rangesh, Kevin Joseph, Matthew Strong, Masayoshi Tomizuka, Yihan Hu, Wei Zhan
Abstract
Developing autonomous vehicles (AVs) requires not only safety and efficiency,
but also realistic, human-like behaviors that are socially aware and
predictable. Achieving this requires sim agent policies that are human-like,
fast, and scalable in multi-agent settings. Recent progress in imitation
learning with large diffusion-based or tokenized models has shown that
behaviors can be captured directly from human driving data, producing realistic
policies. However, these models are computationally expensive, slow during
inference, and struggle to adapt in reactive, closed-loop scenarios. In
contrast, self-play reinforcement learning (RL) scales efficiently and
naturally captures multi-agent interactions, but it often relies on heuristics
and reward shaping, and the resulting policies can diverge from human norms. We
propose SPACeR, a framework that leverages a pretrained tokenized
autoregressive motion model as a centralized reference policy to guide
decentralized self-play. The reference model provides likelihood rewards and KL
divergence, anchoring policies to the human driving distribution while
preserving RL scalability. Evaluated on the Waymo Sim Agents Challenge, our
method achieves competitive performance with imitation-learned policies while
being up to 10x faster at inference and 50x smaller in parameter size than
large generative models. In addition, we demonstrate in closed-loop ego
planning evaluation tasks that our sim agents can effectively measure planner
quality with fast and scalable traffic simulation, establishing a new paradigm
for testing autonomous driving policies.