What's in Common? Multimodal Models Hallucinate When Reasoning Across Scenes
2511.03768v1
cs.LG, cs.CV
2025-11-08
Авторы:
Candace Ross, Florian Bordes, Adina Williams, Polina Kirichenko, Mark Ibrahim
Abstract
Multimodal language models possess a remarkable ability to handle an
open-vocabulary's worth of objects. Yet the best models still suffer from
hallucinations when reasoning about scenes in the real world, revealing a gap
between their seemingly strong performance on existing perception benchmarks
that are saturating and their reasoning in the real world. To address this gap,
we build a novel benchmark of in-the-wild scenes that we call Common-O. With
more than 10.5k examples using exclusively new images not found in web training
data to avoid contamination, Common-O goes beyond just perception, inspired by
cognitive tests for humans, to probe reasoning across scenes by asking "what's
in common?". We evaluate leading multimodal language models, including models
specifically trained to perform chain-of-thought reasoning. We find that
perceiving objects in single images is tractable for most models, yet reasoning
across scenes is very challenging even for the best models, including reasoning
models. Despite saturating many leaderboards focusing on perception, the best
performing model only achieves 35% on Common-O -- and on Common-O Complex,
consisting of more complex scenes, the best model achieves only 1%. Curiously,
we find models are more prone to hallucinate when similar objects are present
in the scene, suggesting models may be relying on object co-occurrence seen
during training. Among the models we evaluated, we found scale can provide
modest improvements while models explicitly trained with multi-image inputs
show bigger improvements, suggesting scaled multi-image training may offer
promise. We make our benchmark publicly available to spur research into the
challenge of hallucination when reasoning across scenes.
Ссылки и действия
Дополнительные ресурсы: