QG-CoC: Question-Guided Chain-of-Captions for Large Multimodal Models
2511.03206v1
cs.CV, cs.AI, cs.LG
2025-11-07
Авторы:
Kuei-Chun Kao, Hsu Tzu-Yin, Yunqi Hong, Ruochen Wang, Cho-Jui Hsieh
Abstract
Recently, Multimodal Large Language Models (MLLMs) encounter two key issues
in multi-image contexts: (1) a lack of fine-grained perception across disparate
images, and (2) a diminished capability to effectively reason over and
synthesize information from multiple visual inputs. However, while various
prompting methods aim to describe visual content, many existing studies focus
primarily on single-image settings or specific, constrained scenarios. This
leaves a critical gap in understanding and addressing how MLLMs tackle more
general and complex multi-image reasoning tasks. Thus, we first extensively
investigate how current prompting methods perceive fine-grained visual details
and process visual information when dealing with multiple images. Our findings
reveal that existing prompting methods fall short in attending to needed clues
and seamlessly integrating perception and reasoning. Inspired by the findings,
we propose a new zero-shot prompting method, Question-Guided Chain-of-Captions
(QG-CoC), a generalized prompting approach that effectively handles problems
with an arbitrary number of images. We evaluate our method on various
open-source and closed-source MLLMs for multi-image and single-image
benchmarks. Experimental results indicate that QG-CoC demonstrates competitive
performance across tasks and exhibits robust improvements in the challenging
scenarios where existing prompting methods fail.
Ссылки и действия
Дополнительные ресурсы: