$\left|\,\circlearrowright\,\boxed{\text{BUS}}\,\right|$: A Large and Diverse Multimodal Benchmark for evaluating the ability of Vision-Language Models to understand Rebus Puzzles
2511.01340v1
cs.CV, cs.CL
2025-11-06
Авторы:
Trishanu Das, Abhilash Nandy, Khush Bajaj, Deepiha S
Abstract
Understanding Rebus Puzzles (Rebus Puzzles use pictures, symbols, and letters
to represent words or phrases creatively) requires a variety of skills such as
image recognition, cognitive skills, commonsense reasoning, multi-step
reasoning, image-based wordplay, etc., making this a challenging task for even
current Vision-Language Models. In this paper, we present
$\left|\,\circlearrowright\,\boxed{\text{BUS}}\,\right|$, a large and diverse
benchmark of $1,333$ English Rebus Puzzles containing different artistic styles
and levels of difficulty, spread across 18 categories such as food, idioms,
sports, finance, entertainment, etc. We also propose $RebusDescProgICE$, a
model-agnostic framework which uses a combination of an unstructured
description and code-based, structured reasoning, along with better,
reasoning-based in-context example selection, improving the performance of
Vision-Language Models on
$\left|\,\circlearrowright\,\boxed{\text{BUS}}\,\right|$ by $2.1-4.1\%$ and
$20-30\%$ using closed-source and open-source models respectively compared to
Chain-of-Thought Reasoning.
Ссылки и действия
Дополнительные ресурсы: