Can Large Vision-Language Models Understand Multimodal Sarcasm?

2508.03654v1 cs.CL, cs.CV 2025-08-09

Авторы:

Xinyu Wang, Yue Zhang, Liqiang Jing

Резюме на русском

Многомодальная сарказм-анализ (MSA) является сложной задачей, которая затрудняется способностью понимать диспаритет между литеральным и намеренным значением сарказма. Несмотря на развитие многомодальных подходов, применение больших лингво-визуальных моделей к этой задаче до сих пор недостаточно изучено. В данной работе анализируется эффективность таких моделей в MSA, в том числе в задачах детекции и объяснения сарказма. Найдены ключевые ограничения, такие как недостаточное понимание визуальной информации и отсутствие концептуальных знаний. Для устранения этих проблем предлагается новый тренировочно-свободный подход, который использует расширенное извлечение объектов и внешние концептуальные знания. Эксперименты показали, что предложенный подход улучшает точность и эффективность моделей в задачах MSA. Результаты опубликованы на https://github.com/cp-cp/LVLM-MSA.

Abstract

Sarcasm is a complex linguistic phenomenon that involves a disparity between literal and intended meanings, making it challenging for sentiment analysis and other emotion-sensitive tasks. While traditional sarcasm detection methods primarily focus on text, recent approaches have incorporated multimodal information. However, the application of Large Visual Language Models (LVLMs) in Multimodal Sarcasm Analysis (MSA) remains underexplored. In this paper, we evaluate LVLMs in MSA tasks, specifically focusing on Multimodal Sarcasm Detection and Multimodal Sarcasm Explanation. Through comprehensive experiments, we identify key limitations, such as insufficient visual understanding and a lack of conceptual knowledge. To address these issues, we propose a training-free framework that integrates in-depth object extraction and external conceptual knowledge to improve the model's ability to interpret and explain sarcasm in multimodal contexts. The experimental results on multiple models show the effectiveness of our proposed framework. The code is available at https://github.com/cp-cp/LVLM-MSA.

Ссылки и действия

Читать на arXiv Скачать PDF

Дополнительные ресурсы:

Can Large Vision-Language Models Understand Multimodal Sarcasm?

Авторы:

Резюме на русском

Abstract

Ссылки и действия

Связанные статьи

Visual Puns from Idioms: An Iterative LLM-T2IM-MLLM Framework

Optimizing Multimodal Language Models through Attention-based Interpretability

Bangla Sign Language Translation: Dataset Creation Challenges, Benchmarking and ...

Do Vision-Language Models Understand Visual Persuasiveness?

Arctic-Extract Technical Report

Навигация