Can MLLMs Read the Room? A Multimodal Benchmark for Verifying Truthfulness in Multi-Party Social Interactions
2510.27195v1
cs.CV, cs.CL, cs.SI
2025-11-04
Авторы:
Caixin Kang, Yifei Huang, Liangyang Ouyang, Mingfang Zhang, Yoichi Sato
Abstract
As AI systems become increasingly integrated into human lives, endowing them
with robust social intelligence has emerged as a critical frontier. A key
aspect of this intelligence is discerning truth from deception, a ubiquitous
element of human interaction that is conveyed through a complex interplay of
verbal language and non-verbal visual cues. However, automatic deception
detection in dynamic, multi-party conversations remains a significant
challenge. The recent rise of powerful Multimodal Large Language Models
(MLLMs), with their impressive abilities in visual and textual understanding,
makes them natural candidates for this task. Consequently, their capabilities
in this crucial domain are mostly unquantified. To address this gap, we
introduce a new task, Multimodal Interactive Veracity Assessment (MIVA), and
present a novel multimodal dataset derived from the social deduction game
Werewolf. This dataset provides synchronized video, text, with verifiable
ground-truth labels for every statement. We establish a comprehensive benchmark
evaluating state-of-the-art MLLMs, revealing a significant performance gap:
even powerful models like GPT-4o struggle to distinguish truth from falsehood
reliably. Our analysis of failure modes indicates that these models fail to
ground language in visual social cues effectively and may be overly
conservative in their alignment, highlighting the urgent need for novel
approaches to building more perceptive and trustworthy AI systems.
Ссылки и действия
Дополнительные ресурсы: