📊 Статистика дайджестов
Всего дайджестов: 34022 Добавлено сегодня: 82
Последнее обновление: сегодня
Авторы:
Xiangen Hu, Jiarui Tong, Sheng Xu
Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']
Annotation:
Training and education in human-centered fields require authentic practice,
yet realistic simulations of human behavior have remained limited. We present a
multi-agent psychological simulation system that models internal
cognitive-affective processes to generate believable human behaviors. In
contrast to black-box neural models, this system is grounded in established
psychological theories (e.g., self-efficacy, mindset, social constructivism)
and explicitly simulates an ``inner parliament'' of a...
Авторы:
Jieshan Chen, Suyu Ma, Qinghua Lu, Sung Une Lee, Liming Zhu
Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']
Annotation:
Before deploying an AI system to replace an existing process, it must be
compared with the incumbent to ensure improvement without added risk.
Traditional evaluation relies on ground truth for both systems, but this is
often unavailable due to delayed or unknowable outcomes, high costs, or
incomplete data, especially for long-standing systems deemed safe by
convention. The more practical solution is not to compute absolute risk but the
difference between systems. We therefore propose a marginal ...
📄 CodeAlignBench: Assessing Code Generation Models on Developer-Preferred Code Adjustments
2025-11-04Авторы:
Forough Mehralian, Ryan Shar, James R. Rae, Alireza Hashemi
Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']
Annotation:
As large language models become increasingly capable of generating code,
evaluating their performance remains a complex and evolving challenge. Existing
benchmarks primarily focus on functional correctness, overlooking the diversity
of real-world coding tasks and developer expectations. To this end, we
introduce a multi-language benchmark that evaluates LLM instruction-following
capabilities and is extensible to operate on any set of standalone coding
problems. Our benchmark evaluates instructio...
📄 Symbolically Scaffolded Play: Designing Role-Sensitive Prompts for Generative NPC Dialogue
2025-11-01Авторы:
Vanessa Figueiredo, David Elumeze
Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']
Annotation:
Large Language Models (LLMs) promise to transform interactive games by
enabling non-player characters (NPCs) to sustain unscripted dialogue. Yet it
remains unclear whether constrained prompts actually improve player experience.
We investigate this question through The Interview, a voice-based detective
game powered by GPT-4o. A within-subjects usability study ($N=10$) compared
high-constraint (HCP) and low-constraint (LCP) prompts, revealing no reliable
experiential differences beyond sensitivit...
Авторы:
Nissan Yaron, Dan Bystritsky, Ben-Etzion Yaron
Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']
Annotation:
We introduce Humans-Junior, a 3.8B model that matches GPT-4o on the FACTS
Grounding public subset within a $\pm 5$ pp equivalence margin.
Results. On Q1--Q500 under identical judges, GPT-4o scores 73.5% (95% CI
69.5--77.2) and Humans-Junior 72.7% (95% CI 68.7--76.5); the paired difference
is 0.8 pp (bootstrap 95% CI $-3.1$ to $+4.7$; permutation $p = 0.72$; Cohen's
$d = 0.023$). TOST establishes equivalence at $\pm 5$ pp (not at $\pm 3$ pp).
When purchased as managed APIs, Humans-Junior's base...
📄 Can AI be Accountable?
2025-11-01Авторы:
Andrew L. Kun
Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']
Annotation:
The AI we use is powerful, and its power is increasing rapidly. If this
powerful AI is to serve the needs of consumers, voters, and decision makers,
then it is imperative that the AI is accountable. In general, an agent is
accountable to a forum if the forum can request information from the agent
about its actions, if the forum and the agent can discuss this information, and
if the forum can sanction the agent. Unfortunately, in too many cases today's
AI is not accountable -- we cannot question ...
Авторы:
Rishub Jain, Sophie Bridgers, Lili Janzer, Rory Greig, Tian Huey Teh, Vladimir Mikulik
Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']
Annotation:
Human feedback is critical for aligning AI systems to human values. As AI
capabilities improve and AI is used to tackle more challenging tasks, verifying
quality and safety becomes increasingly challenging. This paper explores how we
can leverage AI to improve the quality of human oversight. We focus on an
important safety problem that is already challenging for humans:
fact-verification of AI outputs. We find that combining AI ratings and human
ratings based on AI rater confidence is better tha...
Авторы:
Stefano Natangelo
Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']
Annotation:
Artificial intelligence systems based on large language models (LLMs) can now
generate coherent text, music, and images, yet they operate without a
persistent state: each inference reconstructs context from scratch. This paper
introduces the Narrative Continuity Test (NCT) -- a conceptual framework for
evaluating identity persistence and diachronic coherence in AI systems. Unlike
capability benchmarks that assess task performance, the NCT examines whether an
LLM remains the same interlocutor acr...
📄 Towards Human-AI Synergy in Requirements Engineering: A Framework and Preliminary Study
2025-10-31Авторы:
Mateen Ahmed Abbasi, Petri Ihantola, Tommi Mikkonen, Niko Mäkitalo
Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']
Annotation:
The future of Requirements Engineering (RE) is increasingly driven by
artificial intelligence (AI), reshaping how we elicit, analyze, and validate
requirements. Traditional RE is based on labor-intensive manual processes prone
to errors and complexity. AI-powered approaches, specifically large language
models (LLMs), natural language processing (NLP), and generative AI, offer
transformative solutions and reduce inefficiencies. However, the use of AI in
RE also brings challenges like algorithmic ...
Авторы:
Shuqing Li, Jiayi Yan, Chenyu Niu, Jen-tse Huang, Yun Peng, Wenxuan Wang, Yepang Liu, Michael R. Lyu
Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']
Annotation:
Virtual Reality (VR) games require players to translate high-level semantic
actions into precise device manipulations using controllers and head-mounted
displays (HMDs). While humans intuitively perform this translation based on
common sense and embodied understanding, whether Large Language Models (LLMs)
can effectively replicate this ability remains underexplored. This paper
introduces a benchmark, ComboBench, evaluating LLMs' capability to translate
semantic actions into VR device manipulatio...
Показано 71 -
80
из 238 записей