VaPR -- Vision-language Preference alignment for Reasoning
2510.01700v1
cs.AI, cs.CV, cs.LG
2025-10-04
Авторы:
Rohan Wadhawan, Fabrice Y Harel-Canada, Zi-Yi Dou, Suhaila Shakiah, Robinson Piramuthu, Nanyun Peng
Abstract
Preference finetuning methods like Direct Preference Optimization (DPO) with
AI-generated feedback have shown promise in aligning Large Vision-Language
Models (LVLMs) with human preferences. However, existing techniques overlook
the prevalence of noise in synthetic preference annotations in the form of
stylistic and length biases. To this end, we introduce a hard-negative response
generation framework based on LLM-guided response editing, that produces
rejected responses with targeted errors, maintaining stylistic and length
similarity to the accepted ones. Using this framework, we develop the VaPR
dataset, comprising 30K high-quality samples, to finetune three LVLM families:
LLaVA-V1.5, Qwen2VL & Qwen2.5VL (2B-13B sizes). Our VaPR models deliver
significant performance improvements across ten benchmarks, achieving average
gains of 6.5% (LLaVA), 4.0% (Qwen2VL), and 1.5% (Qwen2.5VL), with notable
improvements on reasoning tasks. A scaling analysis shows that performance
consistently improves with data size, with LLaVA models benefiting even at
smaller scales. Moreover, VaPR reduces the tendency to answer "Yes" in binary
questions - addressing a common failure mode in LVLMs like LLaVA. Lastly, we
show that the framework generalizes to open-source LLMs as editors, with models
trained on VaPR-OS achieving ~99% of the performance of models trained on
\name, which is synthesized using GPT-4o. Our data, models, and code can be
found on the project page https://vap-r.github.io
Ссылки и действия
Дополнительные ресурсы: