Nexus: Execution-Grounded Multi-Agent Test Oracle Synthesis
2510.26423v1
cs.SE, cs.CL
2025-11-01
Авторы:
Dong Huang, Mingzhe Du, Jie M. Zhang, Zheng Lin, Meng Luo, Qianru Zhang, See-Kiong Ng
Abstract
Test oracle generation in non-regression testing is a longstanding challenge
in software engineering, where the goal is to produce oracles that can
accurately determine whether a function under test (FUT) behaves as intended
for a given input. In this paper, we introduce Nexus, a novel multi-agent
framework to address this challenge. Nexus generates test oracles by leveraging
a diverse set of specialized agents that synthesize test oracles through a
structured process of deliberation, validation, and iterative self-refinement.
During the deliberation phase, a panel of four specialist agents, each
embodying a distinct testing philosophy, collaboratively critiques and refines
an initial set of test oracles. Then, in the validation phase, Nexus generates
a plausible candidate implementation of the FUT and executes the proposed
oracles against it in a secure sandbox. For any oracle that fails this
execution-based check, Nexus activates an automated selfrefinement loop, using
the specific runtime error to debug and correct the oracle before
re-validation. Our extensive evaluation on seven diverse benchmarks
demonstrates that Nexus consistently and substantially outperforms
state-of-theart baselines. For instance, Nexus improves the test-level oracle
accuracy on the LiveCodeBench from 46.30% to 57.73% for GPT-4.1-Mini. The
improved accuracy also significantly enhances downstream tasks: the bug
detection rate of GPT4.1-Mini generated test oracles on HumanEval increases
from 90.91% to 95.45% for Nexus compared to baselines, and the success rate of
automated program repair improves from 35.23% to 69.32%.
Ссылки и действия
Дополнительные ресурсы: