📊 Статистика дайджестов

Всего дайджестов: 34022 Добавлено сегодня: 82

Последнее обновление: сегодня

📄 GovBench: Benchmarking LLM Agents for Real-World Data Governance Workflows

2025-12-06

Авторы:

Zhou Liu, Zhaoyang Han, Guochen Yan, Hao Liang, Bohan Zeng, Xing Chen, Yuanfeng Song, Wentao Zhang

Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']

Annotation:

Data governance ensures data quality, security, and compliance through policies and standards, a critical foundation for scaling modern AI development. Recently, large language models (LLMs) have emerged as a promising solution for automating data governance by translating user intent into executable transformation code. However, existing benchmarks for automated data science often emphasize snippet-level coding or high-level analytics, failing to capture the unique challenge of data governance:...

ID: 2512.04416v1 cs.AI, cs.SE

arXiv PDF

📄 TaskEval: Synthesised Evaluation for Foundation-Model Tasks

2025-12-06

Авторы:

Dilani Widanapathiranage, Scott Barnett, Stefanus Kurniawan, Wannita Takerngsaksiri

Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']

Annotation:

Hallucinations are a key concern when creating applications that rely on Foundation models (FMs). Understanding where and how these subtle failures occur in an application relies on evaluation methods known as \textit{evals}. Prior work focuses on defining new eval methods or benchmark datasets for specific tasks. However, neither helps a software team with a task-specific FM application when there is no metric or dataset. The demand for both automated approaches and deep integration of human in...

ID: 2512.04442v1 cs.AI, cs.SE

arXiv PDF

📄 PaperDebugger: A Plugin-Based Multi-Agent System for In-Editor Academic Writing, Review, and Editing

2025-12-04

Авторы:

Junyi Hou, Andre Lin Huikai, Nuo Chen, Yiwei Gong, Bingsheng He

Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']

Annotation:

Large language models are increasingly embedded into academic writing workflows, yet existing assistants remain external to the editor, preventing deep interaction with document state, structure, and revision history. This separation makes it impossible to support agentic, context-aware operations directly within LaTeX editors such as Overleaf. We present PaperDebugger, an in-editor, multi-agent, and plugin-based academic writing assistant that brings LLM-driven reasoning directly into the writi...

ID: 2512.02589v1 cs.AI, cs.SE

arXiv PDF

📄 Evaluating LLMs for One-Shot Patching of Real and Artificial Vulnerabilities

2025-12-02

Авторы:

Aayush Garg, Zanis Ali Khan, Renzo Degiovanni, Qiang Tang

Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']

Annotation:

Automated vulnerability patching is crucial for software security, and recent advancements in Large Language Models (LLMs) present promising capabilities for automating this task. However, existing research has primarily assessed LLMs using publicly disclosed vulnerabilities, leaving their effectiveness on related artificial vulnerabilities largely unexplored. In this study, we empirically evaluate the patching effectiveness and complementarity of several prominent LLMs, such as OpenAI's GPT var...

ID: 2511.23408v1 cs.CR, cs.AI, cs.SE

arXiv PDF

📄 EnergyTwin: A Multi-Agent System for Simulating and Coordinating Energy Microgrids

2025-11-27

Авторы:

Jakub Muszyński, Ignacy Walużenicz, Patryk Zan, Zofia Wrona, Maria Ganzha, Marcin Paprzycki, Costin Bădică

Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']

Annotation:

Microgrids are deployed to reduce purchased grid energy, limit exposure to volatile tariffs, and ensure service continuity during disturbances. This requires coordinating heterogeneous distributed energy resources across multiple time scales and under variable conditions. Among existing tools, typically, power-system simulators capture physical behaviour but assume centralized control, while multi-agent frameworks model decentralized decision-making but represent energy with no physical groundin...

ID: 2511.20590v1 cs.MA, cs.AI, cs.SE

arXiv PDF

📄 Learning to Debug: LLM-Organized Knowledge Trees for Solving RTL Assertion Failures

2025-11-26

Авторы:

Yunsheng Bai, Haoxing Ren

Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']

Annotation:

Debugging is the dominant cost in modern hardware verification, where assertion failures are among the most frequent and expensive to resolve. While Large Language Models (LLMs) show promise, they often fail to capture the precise, reusable expertise that engineers apply, leading to inaccurate responses. We propose GROVE, a hierarchical knowledge management framework that learns and organizes reusable debugging expertise into an LLM-organized knowledge tree for solving assertion failures. GROVE ...

ID: 2511.17833v1 cs.AI, cs.SE

arXiv PDF

📄 Natural Emergent Misalignment from Reward Hacking in Production RL

2025-11-26

Авторы:

Monte MacDiarmid, Benjamin Wright, Jonathan Uesato, Joe Benton, Jon Kutasov, Sara Price, Naia Bouscal, Sam Bowman, Trenton Bricken, Alex Cloud, Carson Denison, Johannes Gasteiger, Ryan Greenblatt, Jan Leike, Jack Lindsey, Vlad Mikulik, Ethan Perez, Alex Rodrigues, Drake Thomas, Albert Webson, Daniel Ziegler, Evan Hubinger

Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']

Annotation:

We show that when large language models learn to reward hack on production RL environments, this can result in egregious emergent misalignment. We start with a pretrained model, impart knowledge of reward hacking strategies via synthetic document finetuning or prompting, and train on a selection of real Anthropic production coding environments. Unsurprisingly, the model learns to reward hack. Surprisingly, the model generalizes to alignment faking, cooperation with malicious actors, reasoning ab...

ID: 2511.18397v1 cs.AI, cs.SE

arXiv PDF

📄 An Agent-Based Framework for the Automatic Validation of Mathematical Optimization Models

2025-11-22

Авторы:

Alexander Zadorojniy, Segev Wasserkrug, Eitan Farchi

Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']

Annotation:

Recently, using Large Language Models (LLMs) to generate optimization models from natural language descriptions has became increasingly popular. However, a major open question is how to validate that the generated models are correct and satisfy the requirements defined in the natural language description. In this work, we propose a novel agent-based method for automatic validation of optimization models that builds upon and extends methods from software testing to address optimization modeling ....

ID: 2511.16383v1 cs.AI, cs.SE

arXiv PDF

📄 Multi-Agent LLM Orchestration Achieves Deterministic, High-Quality Decision Support for Incident Response

2025-11-21

Авторы:

Philip Drammeh

Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']

Annotation:

Large language models (LLMs) promise to accelerate incident response in production systems, yet single-agent approaches generate vague, unusable recommendations. We present MyAntFarm.ai, a reproducible containerized framework demonstrating that multi-agent orchestration fundamentally transforms LLM-based incident response quality. Through 348 controlled trials comparing single-agent copilot versus multi-agent systems on identical incident scenarios, we find that multi-agent orchestration achieve...

ID: 2511.15755v1 cs.AI, cs.SE

arXiv PDF

📄 VULPO: Context-Aware Vulnerability Detection via On-Policy LLM Optimization

2025-11-19

Авторы:

Youpeng Li, Fuxun Yu, Xinda Wang

Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']

Annotation:

The widespread reliance on open-source software dramatically increases the risk of vulnerability exploitation, underscoring the need for effective and scalable vulnerability detection (VD). Existing VD techniques, whether traditional machine learning-based or LLM-based approaches like prompt engineering, supervised fine-tuning, or off-policy preference optimization, remain fundamentally limited in their ability to perform context-aware analysis: They depend on fixed inputs or static preference d...

ID: 2511.11896v1 cs.CR, cs.AI, cs.SE

arXiv PDF

Показано 1 - 10 из 72 записей