Flip-Flop Consistency: Unsupervised Training for Robustness to Prompt Perturbations in LLMs
2510.14242v1
cs.CL, cs.LG
2025-10-18
Авторы:
Parsa Hejabi, Elnaz Rahmati, Alireza S. Ziabari, Morteza Dehghani
Abstract
Large Language Models (LLMs) often produce inconsistent answers when faced
with different phrasings of the same prompt. In this paper, we propose
Flip-Flop Consistency ($F^2C$), an unsupervised training method that improves
robustness to such perturbations. $F^2C$ is composed of two key components. The
first, Consensus Cross-Entropy (CCE), uses a majority vote across prompt
variations to create a hard pseudo-label. The second is a representation
alignment loss that pulls lower-confidence and non-majority predictors toward
the consensus established by high-confidence, majority-voting variations. We
evaluate our method on 11 datasets spanning four NLP tasks, with 4-15 prompt
variations per dataset. On average, $F^2C$ raises observed agreement by 11.62%,
improves mean $F_1$ by 8.94%, and reduces performance variance across formats
by 3.29%. In out-of-domain evaluations, $F^2C$ generalizes effectively,
increasing $\overline{F_1}$ and agreement while decreasing variance across most
source-target pairs. Finally, when trained on only a subset of prompt
perturbations and evaluated on held-out formats, $F^2C$ consistently improves
both performance and agreement while reducing variance. These findings
highlight $F^2C$ as an effective unsupervised method for enhancing LLM
consistency, performance, and generalization under prompt perturbations. Code
is available at
https://github.com/ParsaHejabi/Flip-Flop-Consistency-Unsupervised-Training-for-Robustness-to-Prompt-Perturbations-in-LLMs.
Ссылки и действия
Дополнительные ресурсы: