Improving Chain-of-Thought Efficiency for Autoregressive Image Generation
2510.05593v1
cs.CV, cs.AI, cs.CL
2025-10-09
Авторы:
Zeqi Gu, Markos Georgopoulos, Xiaoliang Dai, Marjan Ghazvininejad, Chu Wang, Felix Juefei-Xu, Kunpeng Li, Yujun Shi, Zecheng He, Zijian He, Jiawei Zhou, Abe Davis, Jialiang Wang
Abstract
Autoregressive multimodal large language models have recently gained
popularity for image generation, driven by advances in foundation models. To
enhance alignment and detail, newer approaches employ chain-of-thought (CoT)
reasoning, expanding user inputs into elaborated prompts prior to image
synthesis. However, this strategy can introduce unnecessary redundancy -- a
phenomenon we call visual overthinking -- which increases computational costs
and can introduce details that contradict the original prompt. In this work, we
explore how to generate more concise CoT sequences for more efficient image
generation. We introduce ShortCoTI, a lightweight optimization framework that
encourages more concise CoT while preserving output image quality. ShortCoTI
rewards more concise prompts with an adaptive function that scales according to
an estimated difficulty for each task. Incorporating this reward into a
reinforcement learning paradigm reduces prompt reasoning length by 54% while
maintaining or slightly improving quality metrics across multiple benchmarks
(T2I-CompBench, GenEval). Qualitative analysis shows that our method eliminates
verbose explanations and repetitive refinements, producing reasoning prompts
that are both concise and semantically rich. As a result, ShortCoTI improves
computational efficiency without compromising the fidelity or visual appeal of
generated images.
Ссылки и действия
Дополнительные ресурсы: