Textual Entailment and Token Probability as Bias Evaluation Metrics
2510.07662v1
cs.CL, cs.CY, I.2.7; K.4.2
2025-10-11
Авторы:
Virginia K. Felkner, Allison Lim, Jonathan May
Abstract
Measurement of social bias in language models is typically by token
probability (TP) metrics, which are broadly applicable but have been criticized
for their distance from real-world langugage model use cases and harms. In this
work, we test natural language inference (NLI) as a more realistic alternative
bias metric. We show that, curiously, NLI and TP bias evaluation behave
substantially differently, with very low correlation among different NLI
metrics and between NLI and TP metrics. We find that NLI metrics are more
likely to detect "underdebiased" cases. However, NLI metrics seem to be more
brittle and sensitive to wording of counterstereotypical sentences than TP
approaches. We conclude that neither token probability nor natural language
inference is a "better" bias metric in all cases, and we recommend a
combination of TP, NLI, and downstream bias evaluations to ensure comprehensive
evaluation of language models.
Content Warning: This paper contains examples of anti-LGBTQ+ stereotypes.