Do Students Debias Like Teachers? On the Distillability of Bias Mitigation Methods

2510.26038v1 cs.LG, cs.AI, cs.CL, cs.CV 2025-11-01

Авторы:

Jiali Cheng, Chirag Agarwal, Hadi Amiri

Abstract

Knowledge distillation (KD) is an effective method for model compression and transferring knowledge between models. However, its effect on model's robustness against spurious correlations that degrade performance on out-of-distribution data remains underexplored. This study investigates the effect of knowledge distillation on the transferability of ``debiasing'' capabilities from teacher models to student models on natural language inference (NLI) and image classification tasks. Through extensive experiments, we illustrate several key findings: (i) overall the debiasing capability of a model is undermined post-KD; (ii) training a debiased model does not benefit from injecting teacher knowledge; (iii) although the overall robustness of a model may remain stable post-distillation, significant variations can occur across different types of biases; and (iv) we pin-point the internal attention pattern and circuit that causes the distinct behavior post-KD. Given the above findings, we propose three effective solutions to improve the distillability of debiasing methods: developing high quality data for augmentation, implementing iterative knowledge distillation, and initializing student models with weights obtained from teacher models. To the best of our knowledge, this is the first study on the effect of KD on debiasing and its interenal mechanism at scale. Our findings provide understandings on how KD works and how to design better debiasing methods.

Ссылки и действия

Читать на arXiv Скачать PDF

Дополнительные ресурсы:

Do Students Debias Like Teachers? On the Distillability of Bias Mitigation Methods

Авторы:

Abstract

Ссылки и действия

Связанные статьи

Impact of Layer Norm on Memorization and Generalization in Transformers

Metis-SPECS: Decoupling Multimodal Learning via Self-distilled Preference-based ...

Compress to Impress: Efficient LLM Adaptation Using a Single Gradient Step on 10...

Towards Reversible Model Merging For Low-rank Weights

Translution: Unifying Self-attention and Convolution for Adaptive and Relative M...

Навигация