Representation-Level Counterfactual Calibration for Debiased Zero-Shot Recognition
2510.26466v1
cs.CV, cs.LG
2025-11-01
Авторы:
Pei Peng, MingKun Xie, Hang Hao, Tong Jin, ShengJun Huang
Abstract
Object-context shortcuts remain a persistent challenge in vision-language
models, undermining zero-shot reliability when test-time scenes differ from
familiar training co-occurrences. We recast this issue as a causal inference
problem and ask: Would the prediction remain if the object appeared in a
different environment? To answer this at inference time, we estimate object and
background expectations within CLIP's representation space, and synthesize
counterfactual embeddings by recombining object features with diverse
alternative contexts sampled from external datasets, batch neighbors, or
text-derived descriptions. By estimating the Total Direct Effect and simulating
intervention, we further subtract background-only activation, preserving
beneficial object-context interactions while mitigating hallucinated scores.
Without retraining or prompt design, our method substantially improves both
worst-group and average accuracy on context-sensitive benchmarks, establishing
a new zero-shot state of the art. Beyond performance, our framework provides a
lightweight representation-level counterfactual approach, offering a practical
causal avenue for debiased and reliable multimodal reasoning.
Ссылки и действия
Дополнительные ресурсы: