Automated Validation of LLM-based Evaluators for Software Engineering Artifacts

2508.02827v1 cs.SE, cs.AI 2025-08-09

Авторы:

Ora Nova Fandina, Eitan Farchi, Shmulik Froimovich, Rami Katan, Alice Podolsky, Orna Raz, Avi Ziv

Резюме на русском

Авторы предлагают REFINE, автоматизированный фреймворк для оценки качества LLM-based evaluators при решении задач в сфере software engineering. Основная проблема заключается в том, что ручные оценки — дорогие и субъективные, а существующие методы автоматизации не могут выявить тонкие различия в качестве продуктов. REFINE решает эту проблему с помощью двух модулей: Hierarchy Dataset Builder, который генерирует прогрессирующиеся вариации качества, и Evaluator Tester, который измеряет точность рейтингов. Особенностью REFINE является управляемость: пользователь может адаптировать тонкость оценки, начиная от крупных фильтраций до тестирования на самых скрытых дефектов. Фреймворк был применен в IBM для работы с COBOL и позволил повысить точность оценки до 0.9 в некоторых задачах. Теперь REFINE используется для поддержки релизов моделей.

Abstract

Automation in software engineering increasingly relies on large language models (LLMs) to generate, review, and assess code artifacts. However, establishing LLMs as reliable evaluators remains an open challenge: human evaluations are costly, subjective and non scalable, while existing automated methods fail to discern fine grained variations in artifact quality. We introduce REFINE (Ranking Evaluators for FIne grained Nuanced Evaluation), an automated framework for benchmarking LLM based evaluators across software engineering tasks. REFINE comprises of two modules: Hierarchy Dataset Builder applies novel generation techniques to automatically synthesize artifacts with progressively reduced quality, and Evaluator Tester quantifies each candidate evaluator configuration by measuring how closely its rankings align with expected ordering. A key feature of REFINE is controllability: users can tune the granularity of degradation to progressively refine evaluator configurations, from coarse filtering to stress testing on subtle quality gaps. While the methodology is general, we focus on coding tasks reflecting the practical demands in our production setting. REFINE was integrated into IBM's internal development workflows and applied to code generation, translation, and summarization for COBOL, an enterprise critical programming language, using industrial data. It was used to identify LLM as a Judge configurations that lifted alignment scores from below $0.7$ to above $0.9$ in some coding tasks. These nuance sensitive evaluators are now actively used by model training teams to support model release decisions.

Ссылки и действия

Читать на arXiv Скачать PDF

Дополнительные ресурсы:

Automated Validation of LLM-based Evaluators for Software Engineering Artifacts

Авторы:

Резюме на русском

Abstract

Ссылки и действия

Связанные статьи

Automating Complex Document Workflows via Stepwise and Rollback-Enabled Operatio...

Quantitative Analysis of Technical Debt and Pattern Violation in Large Language ...

MANTRA: a Framework for Multi-stage Adaptive Noise TReAtment During Training

Beyond Greenfield: The D3 Framework for AI-Driven Productivity in Brownfield Eng...

LLM-as-a-Judge for Scalable Test Coverage Evaluation: Accuracy, Operational Reli...

Навигация