RECODE: Reasoning Through Code Generation for Visual Question Answering
2510.13756v1
cs.CV, cs.AI, cs.LG
2025-10-17
Авторы:
Junhong Shen, Mu Cai, Bo Hu, Ameet Talwalkar, David A Ross, Cordelia Schmid, Alireza Fathi
Abstract
Multimodal Large Language Models (MLLMs) struggle with precise reasoning for
structured visuals like charts and diagrams, as pixel-based perception lacks a
mechanism for verification. To address this, we propose to leverage derendering
-- the process of reverse-engineering visuals into executable code -- as a new
modality for verifiable visual reasoning. Specifically, we propose RECODE, an
agentic framework that first generates multiple candidate programs to reproduce
the input image. It then uses a critic to select the most faithful
reconstruction and iteratively refines the code. This process not only
transforms an ambiguous perceptual task into a verifiable, symbolic problem,
but also enables precise calculations and logical inferences later on. On
various visual reasoning benchmarks such as CharXiv, ChartQA, and Geometry3K,
RECODE significantly outperforms methods that do not leverage code or only use
code for drawing auxiliary lines or cropping. Our work demonstrates that
grounding visual perception in executable code provides a new path toward more
accurate and verifiable multimodal reasoning.
Ссылки и действия
Дополнительные ресурсы: