Generative Adversarial Gumbel MCTS for Abstract Visual Composition Generation

2512.01242v1 cs.CV, cs.AI, cs.CL 2025-12-03

Авторы:

Zirui Zhao, Boye Niu, David Hsu, Wee Sun Lee

Abstract

We study abstract visual composition, in which identity is primarily determined by the spatial configuration and relations among a small set of geometric primitives (e.g., parts, symmetry, topology). They are invariant primarily to texture and photorealistic detail. Composing such structures from fixed components under geometric constraints and vague goal specification (such as text) is non-trivial due to combinatorial placement choices, limited data, and discrete feasibility (overlap-free, allowable orientations), which create a sparse solution manifold ill-suited to purely statistical pixel-space generators. We propose a constraint-guided framework that combines explicit geometric reasoning with neural semantics. An AlphaGo-style search enforces feasibility, while a fine-tuned vision-language model scores semantic alignment as reward signals. Our algorithm uses a policy network as a heuristic in Monte-Carlo Tree Search and fine-tunes the network via search-generated plans. Inspired by the Generative Adversarial Network, we use the generated instances for adversarial reward refinement. Over time, the generation should approach the actual data more closely when the reward model cannot distinguish between generated instances and ground-truth. In the Tangram Assembly task, our approach yields higher validity and semantic fidelity than diffusion and auto-regressive baselines, especially as constraints tighten.

Ссылки и действия

Читать на arXiv Скачать PDF

Дополнительные ресурсы:

Generative Adversarial Gumbel MCTS for Abstract Visual Composition Generation

Авторы:

Abstract

Ссылки и действия

Связанные статьи

Text-Printed Image: Bridging the Image-Text Modality Gap for Text-centric Traini...

NAS-LoRA: Empowering Parameter-Efficient Fine-Tuning for Visual Foundation Model...

StreamGaze: Gaze-Guided Temporal Reasoning and Proactive Understanding in Stream...

ReVSeg: Incentivizing the Reasoning Chain for Video Segmentation with Reinforcem...

From Compound Figures to Composite Understanding: Developing a Multi-Modal LLM f...

Навигация