PoSh: Using Scene Graphs To Guide LLMs-as-a-Judge For Detailed Image Descriptions
2510.19060v1
cs.CV, cs.AI, cs.CL
2025-10-24
Авторы:
Amith Ananthram, Elias Stengel-Eskin, Lorena A. Bradford, Julia Demarest, Adam Purvis, Keith Krut, Robert Stein, Rina Elster Pantalony, Mohit Bansal, Kathleen McKeown
Abstract
While vision-language models (VLMs) have advanced into detailed image
description, evaluation remains a challenge. Standard metrics (e.g. CIDEr,
SPICE) were designed for short texts and tuned to recognize errors that are now
uncommon, such as object misidentification. In contrast, long texts require
sensitivity to attribute and relation attachments and scores that localize
errors to particular text spans. In this work, we introduce PoSh, a metric for
detailed image description that uses scene graphs as structured rubrics to
guide LLMs-as-a-Judge, producing aggregate scores grounded in fine-grained
errors (e.g. mistakes in compositional understanding). PoSh is replicable,
interpretable and a better proxy for human raters than existing metrics
(including GPT4o-as-a-Judge). To validate PoSh, we introduce a challenging new
dataset, DOCENT. This novel benchmark contains artwork, paired with
expert-written references, and model-generated descriptions, augmented with
granular and coarse judgments of their quality from art history students. Thus,
DOCENT enables evaluating both detailed image description metrics and detailed
image description itself in a challenging new domain. We show that PoSh
achieves stronger correlations (+0.05 Spearman $\rho$) with the human judgments
in DOCENT than the best open-weight alternatives, is robust to image type
(using CapArena, an existing dataset of web imagery) and is a capable reward
function, outperforming standard supervised fine-tuning. Then, using PoSh, we
characterize the performance of open and closed models in describing the
paintings, sketches and statues in DOCENT and find that foundation models
struggle to achieve full, error-free coverage of images with rich scene
dynamics, establishing a demanding new task to gauge VLM progress. Through both
PoSh and DOCENT, we hope to enable advances in important areas such as
assistive text generation.
Ссылки и действия
Дополнительные ресурсы: