📊 Статистика дайджестов

Всего дайджестов: 34022 Добавлено сегодня: 82

Последнее обновление: сегодня

📄 CodeRL+: Improving Code Generation via Reinforcement with Execution Semantics Alignment

2025-10-23

Авторы:

Xue Jiang, Yihong Dong, Mengyang Liu, Hongyi Deng, Tian Wang, Yongding Tao, Rongyu Cao, Binhua Li, Zhi Jin, Wenpin Jiao, Fei Huang, Yongbin Li, Ge Li

Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']

Annotation:

While Large Language Models (LLMs) excel at code generation by learning from vast code corpora, a fundamental semantic gap remains between their training on textual patterns and the goal of functional correctness, which is governed by formal execution semantics. Reinforcement Learning with Verifiable Rewards (RLVR) approaches attempt to bridge this gap using outcome rewards from executing test cases. However, solely relying on binary pass/fail signals is inefficient for establishing a well-align...

ID: 2510.18471v1 cs.SE, cs.AI, cs.CL

arXiv PDF

📄 WebDevJudge: Evaluating (M)LLMs as Critiques for Web Development Quality

2025-10-23

Авторы:

Chunyang Li, Yilun Zheng, Xinting Huang, Tianqing Fang, Jiahao Xu, Yangqiu Song, Lihui Chen, Han Hu

Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']

Annotation:

The paradigm of LLM-as-a-judge is emerging as a scalable and efficient alternative to human evaluation, demonstrating strong performance on well-defined tasks. However, its reliability in open-ended tasks with dynamic environments and complex interactions remains unexplored. To bridge the gap, we introduce WebDevJudge, a systematic benchmark for assessing LLM-as-a-judge performance in web development, with support for both non-interactive evaluation based on static observations and continuous in...

ID: 2510.18560v1 cs.SE, cs.AI

arXiv PDF

📄 Causally Perturbed Fairness Testing

2025-10-23

Авторы:

Chengwen Du, Tao Chen

Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']

Annotation:

To mitigate unfair and unethical discrimination over sensitive features (e.g., gender, age, or race), fairness testing plays an integral role in engineering systems that leverage AI models to handle tabular data. A key challenge therein is how to effectively reveal fairness bugs under an intractable sample size using perturbation. Much current work has been focusing on designing the test sample generators, ignoring the valuable knowledge about data characteristics that can help guide the perturb...

ID: 2510.18719v1 cs.SE, cs.AI

arXiv PDF

📄 More with Less: An Empirical Study of Turn-Control Strategies for Efficient Coding Agents

2025-10-22

Авторы:

Pengfei Gao, Chao Peng

Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']

Annotation:

LLM-powered coding agents, which operate in iterative loops (turns) to solve software engineering tasks, are becoming increasingly powerful. However, their practical deployment is hindered by significant and unpredictable costs. This challenge arises from a combination of factors: quadratically growing token counts with each turn, the high price of models, the large number of turns required for real-world tasks, and the tendency of agents to take inefficient or unnecessary actions. While existin...

ID: 2510.16786v1 cs.SE, cs.AI, cs.LG

arXiv PDF

📄 When Many-Shot Prompting Fails: An Empirical Study of LLM Code Translation

2025-10-22

Авторы:

Amirkia Rafiei Oskooei, Kaan Baturalp Cosdan, Husamettin Isiktas, Mehmet S. Aktas

Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']

Annotation:

Large Language Models (LLMs) with vast context windows offer new avenues for in-context learning (ICL), where providing many examples ("many-shot" prompting) is often assumed to enhance performance. We investigate this assumption for the complex task of code translation. Through a large-scale empirical study of over 90,000 translations, we systematically evaluate the impact of scaling in-context examples from zero-shot to many-shot configurations of up to 625 examples, with prompts spanning from...

ID: 2510.16809v1 cs.SE, cs.AI, cs.CL, cs.PL, 68T50, 68N30, 68W40, I.2.7; D.2.7; I.2.6

arXiv PDF

📄 TREAT: A Code LLMs Trustworthiness / Reliability Evaluation and Testing Framework

2025-10-22

Авторы:

Shuzheng Gao, Eric John Li, Man Ho Lam, Jingyu Xiao, Yuxuan Wan, Chaozheng Wang, Ng Man Tik, Michael R. Lyu

Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']

Annotation:

Large foundation models are fundamentally transforming the software engineering landscape, demonstrating exceptional capabilities across diverse tasks such as code generation, debugging, and testing. Despite this rapid progress, a significant gap remains in how to comprehensively evaluate these models' trustworthiness in real-world software engineering scenarios. Existing benchmarks suffer from limited task scope and fail to incorporate critical evaluation aspects such as the robustness and reli...

ID: 2510.17163v1 cs.SE, cs.AI

arXiv PDF

📄 An Experimental Study of Real-Life LLM-Proposed Performance Improvements

2025-10-21

Авторы:

Lirong Yi, Gregory Gay, Philipp Leitner

Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']

Annotation:

Large Language Models (LLMs) can generate code, but can they generate fast code? In this paper, we study this question using a dataset of 65 real-world tasks mined from open-source Java programs. We specifically select tasks where developers achieved significant speedups, and employ an automated pipeline to generate patches for these issues using two leading LLMs under four prompt variations. By rigorously benchmarking the results against the baseline and human-authored solutions, we demonstrate...

ID: 2510.15494v1 cs.SE, cs.AI, cs.PF

arXiv PDF

📄 Selecting and Combining Large Language Models for Scalable Code Clone Detection

2025-10-21

Авторы:

Muslim Chochlov, Gul Aftab Ahmed, James Vincent Patten, Yuanhua Han, Guoxian Lu, David Gregg, Jim Buckley

Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']

Annotation:

Source code clones pose risks ranging from intellectual property violations to unintended vulnerabilities. Effective and efficient scalable clone detection, especially for diverged clones, remains challenging. Large language models (LLMs) have recently been applied to clone detection tasks. However, the rapid emergence of LLMs raises questions about optimal model selection and potential LLM-ensemble efficacy. This paper addresses the first question by identifying 76 LLMs and filtering them dow...

ID: 2510.15480v1 cs.SE, cs.AI

arXiv PDF

📄 One Bug, Hundreds Behind: LLMs for Large-Scale Bug Discovery

2025-10-18

Авторы:

Qiushi Wu, Yue Xiao, Dhilung Kirat, Kevin Eykholt, Jiyong Jang, Douglas Lee Schales

Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']

Annotation:

Fixing bugs in large programs is a challenging task that demands substantial time and effort. Once a bug is found, it is reported to the project maintainers, who work with the reporter to fix it and eventually close the issue. However, across the program, there are often similar code segments, which may also contain the bug, but were missed during discovery. Finding and fixing each recurring bug instance individually is labor intensive. Even more concerning, bug reports can inadvertently widen t...

ID: 2510.14036v1 cs.SE, cs.AI

arXiv PDF

📄 E2Edev: Benchmarking Large Language Models in End-to-End Software Development Task

2025-10-18

Авторы:

Jingyao Liu, Chen Huang, Zhizhao Guan, Wenqiang Lei, Yang Deng

Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']

Annotation:

E2EDev comprises (i) a fine-grained set of user requirements, (ii) {multiple BDD test scenarios with corresponding Python step implementations for each requirement}, and (iii) a fully automated testing pipeline built on the Behave framework. To ensure its quality while reducing the annotation effort, E2EDev leverages our proposed Human-in-the-Loop Multi-Agent Annotation Framework (HITL-MAA). {By evaluating various E2ESD frameworks and LLM backbones with E2EDev}, our analysis reveals a persistent...

ID: 2510.14509v1 cs.SE, cs.AI, cs.CL

arXiv PDF

Показано 131 - 140 из 341 записей