Stitch: Training-Free Position Control in Multimodal Diffusion Transformers
2509.26644v1
cs.CV, cs.AI, cs.LG
2025-10-02
Авторы:
Jessica Bader, Mateusz Pach, Maria A. Bravo, Serge Belongie, Zeynep Akata
Abstract
Text-to-Image (T2I) generation models have advanced rapidly in recent years,
but accurately capturing spatial relationships like "above" or "to the right
of" poses a persistent challenge. Earlier methods improved spatial relationship
following with external position control. However, as architectures evolved to
enhance image quality, these techniques became incompatible with modern models.
We propose Stitch, a training-free method for incorporating external position
control into Multi-Modal Diffusion Transformers (MMDiT) via
automatically-generated bounding boxes. Stitch produces images that are both
spatially accurate and visually appealing by generating individual objects
within designated bounding boxes and seamlessly stitching them together. We
find that targeted attention heads capture the information necessary to isolate
and cut out individual objects mid-generation, without needing to fully
complete the image. We evaluate Stitch on PosEval, our benchmark for
position-based T2I generation. Featuring five new tasks that extend the concept
of Position beyond the basic GenEval task, PosEval demonstrates that even top
models still have significant room for improvement in position-based
generation. Tested on Qwen-Image, FLUX, and SD3.5, Stitch consistently enhances
base models, even improving FLUX by 218% on GenEval's Position task and by 206%
on PosEval. Stitch achieves state-of-the-art results with Qwen-Image on
PosEval, improving over previous models by 54%, all accomplished while
integrating position control into leading models training-free. Code is
available at https://github.com/ExplainableML/Stitch.
Ссылки и действия
Дополнительные ресурсы: