📊 Статистика дайджестов

Всего дайджестов: 34022 Добавлено сегодня: 82

Последнее обновление: сегодня

📄 SmartMLOps Studio: Design of an LLM-Integrated IDE with Automated MLOps Pipelines for Model Development and Monitoring

2025-11-06

Авторы:

Jiawei Jin, Yingxin Su, Xiaotong Zhu

Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']

Annotation:

The rapid expansion of artificial intelligence and machine learning (ML) applications has intensified the demand for integrated environments that unify model development, deployment, and monitoring. Traditional Integrated Development Environments (IDEs) focus primarily on code authoring, lacking intelligent support for the full ML lifecycle, while existing MLOps platforms remain detached from the coding workflow. To address this gap, this study proposes the design of an LLM-Integrated IDE with a...

ID: 2511.01850v1 cs.SE, cs.AI

arXiv PDF

📄 Metamorphic Testing of Large Language Models for Natural Language Processing

2025-11-06

Авторы:

Steven Cho, Stefano Ruberto, Valerio Terragni

Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']

Annotation:

Using large language models (LLMs) to perform natural language processing (NLP) tasks has become increasingly pervasive in recent times. The versatile nature of LLMs makes them applicable to a wide range of such tasks. While the performance of recent LLMs is generally outstanding, several studies have shown that they can often produce incorrect results. Automatically identifying these faulty behaviors is extremely useful for improving the effectiveness of LLMs. One obstacle to this is the limite...

ID: 2511.02108v1 cs.SE, cs.AI

arXiv PDF

📄 Open the Oyster: Empirical Evaluation and Improvement of Code Reasoning Confidence in LLMs

2025-11-06

Авторы:

Shufan Wang, Xing Hu, Junkai Chen, Zhiyuan Pan, Xin Xia

Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']

Annotation:

With the widespread application of large language models (LLMs) in the field of code intelligence, increasing attention has been paid to the reliability and controllability of their outputs in code reasoning tasks. Confidence estimation serves as an effective and convenient approach for evaluating these aspects. This paper proposes a confidence analysis and enhancement framework for LLMs tailored to code reasoning tasks. We conduct a comprehensive empirical study on the confidence reliability of...

ID: 2511.02197v1 cs.SE, cs.AI

arXiv PDF

📄 EvoDev: An Iterative Feature-Driven Framework for End-to-End Software Development with LLM-based Agents

2025-11-06

Авторы:

Junwei Liu, Chen Xu, Chong Wang, Tong Bai, Weitong Chen, Kaseng Wong, Yiling Lou, Xin Peng

Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']

Annotation:

Recent advances in large language model agents offer the promise of automating end-to-end software development from natural language requirements. However, existing approaches largely adopt linear, waterfall-style pipelines, which oversimplify the iterative nature of real-world development and struggle with complex, large-scale projects. To address these limitations, we propose EvoDev, an iterative software development framework inspired by feature-driven development. EvoDev decomposes user requ...

ID: 2511.02399v1 cs.SE, cs.AI

arXiv PDF

📄 MARIA: A Framework for Marginal Risk Assessment without Ground Truth in AI Systems

2025-11-04

Авторы:

Jieshan Chen, Suyu Ma, Qinghua Lu, Sung Une Lee, Liming Zhu

Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']

Annotation:

Before deploying an AI system to replace an existing process, it must be compared with the incumbent to ensure improvement without added risk. Traditional evaluation relies on ground truth for both systems, but this is often unavailable due to delayed or unknowable outcomes, high costs, or incomplete data, especially for long-standing systems deemed safe by convention. The more practical solution is not to compute absolute risk but the difference between systems. We therefore propose a marginal ...

ID: 2510.27163v1 cs.SE, cs.AI, cs.HC, D.2.8; D.2.9.m; I.2

arXiv PDF

📄 Vintage Code, Modern Judges: Meta-Validation in Low Data Regimes

2025-11-04

Авторы:

Ora Nova Fandina, Gal Amram, Eitan Farchi, Shmulik Froimovich, Raviv Gal, Wesam Ibraheem, Rami Katan, Alice Podolsky, Orna Raz

Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']

Annotation:

Application modernization in legacy languages such as COBOL, PL/I, and REXX faces an acute shortage of resources, both in expert availability and in high-quality human evaluation data. While Large Language Models as a Judge (LaaJ) offer a scalable alternative to expert review, their reliability must be validated before being trusted in high-stakes workflows. Without principled validation, organizations risk a circular evaluation loop, where unverified LaaJs are used to assess model outputs, pote...

ID: 2510.27244v1 cs.SE, cs.AI

arXiv PDF

📄 CodeAlignBench: Assessing Code Generation Models on Developer-Preferred Code Adjustments

2025-11-04

Авторы:

Forough Mehralian, Ryan Shar, James R. Rae, Alireza Hashemi

Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']

Annotation:

As large language models become increasingly capable of generating code, evaluating their performance remains a complex and evolving challenge. Existing benchmarks primarily focus on functional correctness, overlooking the diversity of real-world coding tasks and developer expectations. To this end, we introduce a multi-language benchmark that evaluates LLM instruction-following capabilities and is extensible to operate on any set of standalone coding problems. Our benchmark evaluates instructio...

ID: 2510.27565v1 cs.SE, cs.AI, cs.HC

arXiv PDF

📄 PRISM: Proof-Carrying Artifact Generation through LLM x MDE Synergy and Stratified Constraints

2025-11-01

Авторы:

Tong Ma, Hui Lai, Hui Wang, Zhenhu Tian, Jizhou Wang, Haichao Wu, Yongfan Gao, Chaochao Li, Fengjie Xu, Ling Fang

Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']

Annotation:

PRISM unifies Large Language Models with Model-Driven Engineering to generate regulator-ready artifacts and machine-checkable evidence for safety- and compliance-critical domains. PRISM integrates three pillars: a Unified Meta-Model (UMM) reconciles heterogeneous schemas and regulatory text into a single semantic space; an Integrated Constraint Model (ICM) compiles structural and semantic requirements into enforcement artifacts including generation-time automata (GBNF, DFA) and post-generation v...

ID: 2510.25890v1 cs.SE, cs.AI, D.2.4; I.2.2

arXiv PDF

📄 A Process Mining-Based System For The Analysis and Prediction of Software Development Workflows

2025-11-01

Авторы:

Antía Dorado, Iván Folgueira, Sofía Martín, Gonzalo Martín, Álvaro Porto, Alejandro Ramos, John Wallace

Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']

Annotation:

CodeSight is an end-to-end system designed to anticipate deadline compliance in software development workflows. It captures development and deployment data directly from GitHub, transforming it into process mining logs for detailed analysis. From these logs, the system generates metrics and dashboards that provide actionable insights into PR activity patterns and workflow efficiency. Building on this structured representation, CodeSight employs an LSTM model that predicts remaining PR resolution...

ID: 2510.25935v1 cs.SE, cs.AI

arXiv PDF

📄 Beyond Synthetic Benchmarks: Evaluating LLM Performance on Real-World Class-Level Code Generation

2025-11-01

Авторы:

Musfiqur Rahman, SayedHassan Khatoonabadi, Emad Shihab

Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']

Annotation:

Large language models (LLMs) have advanced code generation at the function level, yet their ability to produce correct class-level implementations in authentic software projects remains poorly understood. This work introduces a novel benchmark derived from open-source repositories, comprising real-world classes divided into seen and unseen partitions to evaluate generalization under practical conditions. The evaluation examines multiple LLMs under varied input specifications, retrieval-augmented...

ID: 2510.26130v1 cs.SE, cs.AI, cs.LG

arXiv PDF

Показано 91 - 100 из 341 записей