Do Students Debias Like Teachers? On the Distillability of Bias Mitigation Methods
2510.26038v1
cs.LG, cs.AI, cs.CL, cs.CV
2025-11-01
Авторы:
Jiali Cheng, Chirag Agarwal, Hadi Amiri
Abstract
Knowledge distillation (KD) is an effective method for model compression and
transferring knowledge between models. However, its effect on model's
robustness against spurious correlations that degrade performance on
out-of-distribution data remains underexplored. This study investigates the
effect of knowledge distillation on the transferability of ``debiasing''
capabilities from teacher models to student models on natural language
inference (NLI) and image classification tasks. Through extensive experiments,
we illustrate several key findings: (i) overall the debiasing capability of a
model is undermined post-KD; (ii) training a debiased model does not benefit
from injecting teacher knowledge; (iii) although the overall robustness of a
model may remain stable post-distillation, significant variations can occur
across different types of biases; and (iv) we pin-point the internal attention
pattern and circuit that causes the distinct behavior post-KD. Given the above
findings, we propose three effective solutions to improve the distillability of
debiasing methods: developing high quality data for augmentation, implementing
iterative knowledge distillation, and initializing student models with weights
obtained from teacher models. To the best of our knowledge, this is the first
study on the effect of KD on debiasing and its interenal mechanism at scale.
Our findings provide understandings on how KD works and how to design better
debiasing methods.