David vs. Goliath: A comparative study of different-sized LLMs for code generation in the domain of automotive scenario generation

2510.14115v1 cs.SE, cs.LG 2025-10-19

Авторы:

Philipp Bauerfeind, Amir Salarpour, David Fernandez, Pedram MohajerAnsari, Johannes Reschke, Mert D. Pesé

Abstract

Scenario simulation is central to testing autonomous driving systems. Scenic, a domain-specific language (DSL) for CARLA, enables precise and reproducible scenarios, but NL-to-Scenic generation with large language models (LLMs) suffers from scarce data, limited reproducibility, and inconsistent metrics. We introduce NL2Scenic, an open dataset and framework with 146 NL/Scenic pairs, a difficulty-stratified 30-case test split, an Example Retriever, and 14 prompting variants (ZS, FS, CoT, SP, MoT). We evaluate 13 models: four proprietary (GPT-4o, GPT-5, Claude-Sonnet-4, Gemini-2.5-pro) and nine open-source code models (Qwen2.5Coder 0.5B-32B; CodeLlama 7B/13B/34B), using text metrics (BLEU, ChrF, EDIT-SIM, CrystalBLEU) and execution metrics (compilation and generation), and compare them with an expert study (n=11). EDIT-SIM correlates best with human judgments; we also propose EDIT-COMP (F1 of EDIT-SIM and compilation) as a robust dataset-level proxy that improves ranking fidelity. GPT-4o performs best overall, while Qwen2.5Coder-14B reaches about 88 percent of its expert score on local hardware. Retrieval-augmented prompting, Few-Shot with Example Retriever (FSER), consistently boosts smaller models, and scaling shows diminishing returns beyond mid-size, with Qwen2.5Coder outperforming CodeLlama at comparable scales. NL2Scenic and EDIT-COMP offer a standardized, reproducible basis for evaluating Scenic code generation and indicate that mid-size open-source models are practical, cost-effective options for autonomous-driving scenario programming.

Ссылки и действия

Читать на arXiv Скачать PDF

Дополнительные ресурсы:

David vs. Goliath: A comparative study of different-sized LLMs for code generation in the domain of automotive scenario generation

Авторы:

Abstract

Ссылки и действия

Связанные статьи

Large Language Models for Software Engineering: A Reproducibility Crisis

Neural Variable Name Repair: Learning to Rename Identifiers for Readability

stable-pretraining-v1: Foundation Model Research Made Simple

Agint: Agentic Graph Compilation for Software Engineering Agents

Is the Cure Still Worse Than the Disease? Test Overfitting by LLMs in Automated ...

Навигация