Open the Oyster: Empirical Evaluation and Improvement of Code Reasoning Confidence in LLMs
2511.02197v1
cs.SE, cs.AI
2025-11-06
Авторы:
Shufan Wang, Xing Hu, Junkai Chen, Zhiyuan Pan, Xin Xia
Abstract
With the widespread application of large language models (LLMs) in the field
of code intelligence, increasing attention has been paid to the reliability and
controllability of their outputs in code reasoning tasks. Confidence estimation
serves as an effective and convenient approach for evaluating these aspects.
This paper proposes a confidence analysis and enhancement framework for LLMs
tailored to code reasoning tasks. We conduct a comprehensive empirical study on
the confidence reliability of mainstream LLMs across different tasks, and
further evaluate the effectiveness of techniques such as prompt strategy
optimisation and mathematical calibration (e.g., Platt Scaling) in improving
confidence reliability. Our results show that DeepSeek-Reasoner achieves the
best performance across various tasks, outperforming other models by up to
$0.680$, $0.636$, and $13.652$ in terms of ECE, Brier Score, and Performance
Score, respectively. The hybrid strategy combining the reassess prompt strategy
and Platt Scaling achieves improvements of up to $0.541$, $0.628$, and $15.084$
over the original performance in the aforementioned three metrics. These
results indicate that models with reasoning capabilities demonstrate superior
confidence reliability, and that the hybrid strategy is the most effective in
enhancing the confidence reliability of various models. Meanwhile, we elucidate
the impact of different task complexities, model scales, and strategies on
confidence performance, and highlight that the confidence of current LLMs in
complex reasoning tasks still has considerable room for improvement. This study
not only provides a research foundation and technical reference for the
application of confidence in LLM-assisted software engineering, but also points
the way for future optimisation and engineering deployment of confidence
mechanisms.
Ссылки и действия
Дополнительные ресурсы: