David vs. Goliath: A comparative study of different-sized LLMs for code generation in the domain of automotive scenario generation
2510.14115v1
cs.SE, cs.LG
2025-10-19
Авторы:
Philipp Bauerfeind, Amir Salarpour, David Fernandez, Pedram MohajerAnsari, Johannes Reschke, Mert D. Pesé
Abstract
Scenario simulation is central to testing autonomous driving systems. Scenic,
a domain-specific language (DSL) for CARLA, enables precise and reproducible
scenarios, but NL-to-Scenic generation with large language models (LLMs)
suffers from scarce data, limited reproducibility, and inconsistent metrics. We
introduce NL2Scenic, an open dataset and framework with 146 NL/Scenic pairs, a
difficulty-stratified 30-case test split, an Example Retriever, and 14
prompting variants (ZS, FS, CoT, SP, MoT). We evaluate 13 models: four
proprietary (GPT-4o, GPT-5, Claude-Sonnet-4, Gemini-2.5-pro) and nine
open-source code models (Qwen2.5Coder 0.5B-32B; CodeLlama 7B/13B/34B), using
text metrics (BLEU, ChrF, EDIT-SIM, CrystalBLEU) and execution metrics
(compilation and generation), and compare them with an expert study (n=11).
EDIT-SIM correlates best with human judgments; we also propose EDIT-COMP (F1 of
EDIT-SIM and compilation) as a robust dataset-level proxy that improves ranking
fidelity. GPT-4o performs best overall, while Qwen2.5Coder-14B reaches about 88
percent of its expert score on local hardware. Retrieval-augmented prompting,
Few-Shot with Example Retriever (FSER), consistently boosts smaller models, and
scaling shows diminishing returns beyond mid-size, with Qwen2.5Coder
outperforming CodeLlama at comparable scales. NL2Scenic and EDIT-COMP offer a
standardized, reproducible basis for evaluating Scenic code generation and
indicate that mid-size open-source models are practical, cost-effective options
for autonomous-driving scenario programming.
Ссылки и действия
Дополнительные ресурсы: