You May Speak Freely: Improving the Fine-Grained Visual Recognition Capabilities of Multimodal Large Language Models with Answer Extraction
2510.14885v1
cs.CV, cs.CL
2025-10-18
Авторы:
Logan Lawrence, Oindrila Saha, Megan Wei, Chen Sun, Subhransu Maji, Grant Van Horn
Abstract
Despite the renewed interest in zero-shot visual classification due to the
rise of Multimodal Large Language Models (MLLMs), the problem of evaluating
free-form responses of auto-regressive models remains a persistent challenge.
Most existing works focus on language-only tasks or don't consider Multiple
Choice Questions (MCQs) beyond 5-way options, both of which are critical
capabilities to solve tasks in Fine-Grained Visual Classification (FGVC) where
choice counts are in the hundreds to thousands and the choices are highly
related. Furthermore, in this highly multi-way MCQ setting it is not clear how
to extend LLM choice extraction to retrieval-based problems, where computing
probabilities over the choice set is computationally costly. In this work we
investigate nlg2choice, a simple two-stage method which first asks the MLLM an
open-ended question for the task with minimal constraints, then uses text-only
constrained decoding to predict the most likely choice. In retrieval settings,
we compute the probability of the constrained response taking that choice with
an early stopping method to significantly improve throughput. Our results show
improvement over a suite of seven fine-grained visual datasets when evaluating
in terms of classification and retrieval, and show that this performance holds
over the various ways that users of LLMs can implement tasks in natural
language.
Ссылки и действия
Дополнительные ресурсы: