MaskCD: Mitigating LVLM Hallucinations by Image Head Masked Contrastive Decoding
2510.02790v1
cs.CV, cs.AI, cs.CL, cs.MM
2025-10-07
Авторы:
Jingyuan Deng, Yujiu Yang
Abstract
Large vision-language models (LVLMs) have shown remarkable performance in
visual-language understanding for downstream multimodal tasks. While their
capabilities are improving, problems emerge simultaneously. Among those
problems, the hallucinations have attracted much attention, which stands for
the phenomenon where LVLMs generate contradictory content to their input visual
and text contents. Many approaches have been proposed to deal with this issue,
such as contrastive decoding and attention manipulation. However, contrastive
decoding methods struggle in constructing appropriate contrastive samples, and
attention manipulation methods are highly sensitive, lacking stability. In this
work, we propose image head Masked Contrastive Decoding (MaskCD). Our approach
utilizes the "image heads" in LVLMs, masking them to construct contrastive
samples for contrastive decoding. We evaluated MaskCD on LLaVA-1.5-7b and
Qwen-VL-7b, using various benchmarks such as CHAIR, POPE, AMBER and MME. The
results demonstrate that MaskCD effectively alleviates the phenomenon of
hallucinations and retains the general capabilities of LVLMs. Corresponding
resources could be found at: https://github.com/Deng-Jingyuan/MaskCD .