Breaking the Benchmark: Revealing LLM Bias via Minimal Contextual Augmentation
2510.23921v1
cs.CL, cs.LG
2025-10-30
Авторы:
Kaveh Eskandari Miandoab, Mahammed Kamruzzaman, Arshia Gharooni, Gene Louis Kim, Vasanth Sarathy, Ninareh Mehrabi
Abstract
Large Language Models have been shown to demonstrate stereotypical biases in
their representations and behavior due to the discriminative nature of the data
that they have been trained on. Despite significant progress in the development
of methods and models that refrain from using stereotypical information in
their decision-making, recent work has shown that approaches used for bias
alignment are brittle. In this work, we introduce a novel and general
augmentation framework that involves three plug-and-play steps and is
applicable to a number of fairness evaluation benchmarks. Through application
of augmentation to a fairness evaluation dataset (Bias Benchmark for Question
Answering (BBQ)), we find that Large Language Models (LLMs), including
state-of-the-art open and closed weight models, are susceptible to
perturbations to their inputs, showcasing a higher likelihood to behave
stereotypically. Furthermore, we find that such models are more likely to have
biased behavior in cases where the target demographic belongs to a community
less studied by the literature, underlining the need to expand the fairness and
safety research to include more diverse communities.
Ссылки и действия
Дополнительные ресурсы: