V-HUB: A Visual-Centric Humor Understanding Benchmark for Video LLMs
2509.25773v1
cs.CV, cs.AI, cs.CL
2025-10-02
Авторы:
Zhengpeng Shi, Hengli Li, Yanpeng Zhao, Jianqun Zhou, Yuxuan Wang, Qinrong Cui, Wei Bi, Songchun Zhu, Bo Zhao, Zilong Zheng
Abstract
AI models capable of comprehending humor hold real-world promise -- for
example, enhancing engagement in human-machine interactions. To gauge and
diagnose the capacity of multimodal large language models (MLLMs) for humor
understanding, we introduce v-HUB, a novel visual-centric video humor
understanding benchmark. v-HUB comprises a curated collection of minimally
verbal short videos, sourced from classic silent films and online resources,
and reflecting real-world scenarios where humor can be appreciated purely
through visual cues. Each video clip is paired with rich annotations, including
captions, descriptions, and explanations, supporting evaluation tasks like
caption matching and humor explanation. To broaden its applicability, we
further construct an open-ended video QA task, making it readily integrable
into existing video understanding benchmarks. We evaluate a diverse set of
MLLMs, from specialized Video-LLMs to versatile OmniLLMs that can process
audio, covering both open-source and proprietary domains. The experimental
results expose the difficulties MLLMs face in comprehending humor from visual
cues alone. For example, all models exhibit a marked performance drop on
caption matching when moving from text-based to video-based evaluation (without
audio). Our findings also demonstrate that incorporating audio helps with video
humor understanding, highlighting the informativeness of sound and the promise
of integrating richer modalities for complex video understanding tasks.
Ссылки и действия
Дополнительные ресурсы: