VideoNorms: Benchmarking Cultural Awareness of Video Language Models
2510.08543v1
cs.CV, cs.AI, cs.CL, cs.CY
2025-10-11
Авторы:
Nikhil Reddy Varimalla, Yunfei Xu, Arkadiy Saakyan, Meng Fan Wang, Smaranda Muresan
Abstract
As Video Large Language Models (VideoLLMs) are deployed globally, they
require understanding of and grounding in the relevant cultural background. To
properly assess these models' cultural awareness, adequate benchmarks are
needed. We introduce VideoNorms, a benchmark of over 1000 (video clip, norm)
pairs from US and Chinese cultures annotated with socio-cultural norms grounded
in speech act theory, norm adherence and violations labels, and verbal and
non-verbal evidence. To build VideoNorms, we use a human-AI collaboration
framework, where a teacher model using theoretically-grounded prompting
provides candidate annotations and a set of trained human experts validate and
correct the annotations. We benchmark a variety of open-weight VideoLLMs on the
new dataset which highlight several common trends: 1) models performs worse on
norm violation than adherence; 2) models perform worse w.r.t Chinese culture
compared to the US culture; 3) models have more difficulty in providing
non-verbal evidence compared to verbal for the norm adhere/violation label and
struggle to identify the exact norm corresponding to a speech-act; and 4)
unlike humans, models perform worse in formal, non-humorous contexts. Our
findings emphasize the need for culturally-grounded video language model
training - a gap our benchmark and framework begin to address.