Improving Temporal Understanding Logic Consistency in Video-Language Models via Attention Enhancement
2510.08138v1
cs.CV, cs.AI, cs.MM
2025-10-11
Авторы:
Chengzhi Li, Heyan Huang, Ping Jian, Zhen Yang, Yaning Tian
Abstract
Large language models (LLMs) often generate self-contradictory outputs, which
severely impacts their reliability and hinders their adoption in practical
applications. In video-language models (Video-LLMs), this phenomenon recently
draws the attention of researchers. Specifically, these models fail to provide
logically consistent responses to rephrased questions based on their grounding
outputs. However, the underlying causes of this phenomenon remain
underexplored. In this work, we adopt an interpretability-driven approach to
analyze, statistically summarize, and intervention the potential factors of the
phenomenon. We find that one of the primary reasons for the inconsistency in
responses lies in the inability of cross-modal attention heads to effectively
distinguish video tokens across different timestamps. To address this, we
propose an attention enhancement method called Temporally Conditioned Attention
Sharpening (TCAS), which constructs an enhancement objective based on attention
distinctions to enhance the model's temporal resolution capability, thereby
improving its temporal understanding logic consistency. Experimental results
demonstrate that our method significantly enhances the temporal logic
consistency of Video-LLMs. Further interpretability analyses reveal that our
method indeed improves the temporal discriminability of attention heads,
validating our conclusions. Additionally, our method achieves performance
improvements in general video temporal grounding tasks, highlighting that
temporal logic consistency is a bottleneck in temporal understanding. By
enhancing consistency, our method drives significant progress in video temporal
understanding.
Ссылки и действия
Дополнительные ресурсы: