📊 Статистика дайджестов

Всего дайджестов: 34022 Добавлено сегодня: 82

Последнее обновление: сегодня

📄 WorldMM: Dynamic Multimodal Memory Agent for Long Video Reasoning

2025-12-03

Авторы:

Woongyeong Yeo, Kangsan Kim, Jaehong Yoon, Sung Ju Hwang

Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']

Annotation:

Recent advances in video large language models have demonstrated strong capabilities in understanding short clips. However, scaling them to hours- or days-long videos remains highly challenging due to limited context capacity and the loss of critical visual details during abstraction. Existing memory-augmented methods mitigate this by leveraging textual summaries of video segments, yet they heavily rely on text and fail to utilize visual evidence when reasoning over complex scenes. Moreover, ret...

ID: 2512.02425v1 cs.CV, cs.AI, cs.CL, cs.IR, cs.LG

arXiv PDF

📄 When Sufficient is not Enough: Utilizing the Rashomon Effect for Complete Evidence Extraction

2025-11-15

Авторы:

Katharina Beckh, Stefan Rüping

Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']

Annotation:

Feature attribution methods typically provide minimal sufficient evidence justifying a model decision. However, in many applications this is inadequate. For compliance and cataloging, the full set of contributing features must be identified - complete evidence. We perform a case study on a medical dataset which contains human-annotated complete evidence. We show that individual models typically recover only subsets of complete evidence and that aggregating evidence from several models improves e...

ID: 2511.07055v1 cs.CL, cs.IR, cs.LG

arXiv PDF

📄 AgentPRM: Process Reward Models for LLM Agents via Step-Wise Promise and Progress

2025-11-15

Авторы:

Zhiheng Xi, Chenyang Liao, Guanyu Li, Yajie Yang, Wenxiang Chen, Zhihao Zhang, Binghai Wang, Senjie Jin, Yuhao Zhou, Jian Guan, Wei Wu, Tao Ji, Tao Gui, Qi Zhang, Xuanjing Huang

Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']

Annotation:

Despite rapid development, large language models (LLMs) still encounter challenges in multi-turn decision-making tasks (i.e., agent tasks) like web shopping and browser navigation, which require making a sequence of intelligent decisions based on environmental feedback. Previous work for LLM agents typically relies on elaborate prompt engineering or fine-tuning with expert trajectories to improve performance. In this work, we take a different perspective: we explore constructing process reward m...

ID: 2511.08325v1 cs.CL, cs.IR, cs.LG

arXiv PDF

📄 Towards Universal Video Retrieval: Generalizing Video Embedding via Synthesized Multimodal Pyramid Curriculum

2025-11-04

Авторы:

Zhuoning Guo, Mingxin Li, Yanzhao Zhang, Dingkun Long, Pengjun Xie, Xiaowen Chu

Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']

Annotation:

The prevailing video retrieval paradigm is structurally misaligned, as narrow benchmarks incentivize correspondingly limited data and single-task training. Therefore, universal capability is suppressed due to the absence of a diagnostic evaluation that defines and demands multi-dimensional generalization. To break this cycle, we introduce a framework built on the co-design of evaluation, data, and modeling. First, we establish the Universal Video Retrieval Benchmark (UVRB), a suite of 16 dataset...

ID: 2510.27571v1 cs.CV, cs.AI, cs.CL, cs.IR, cs.LG

arXiv PDF

📄 DeepAgent: A General Reasoning Agent with Scalable Toolsets

2025-10-28

Авторы:

Xiaoxi Li, Wenxiang Jiao, Jiarui Jin, Guanting Dong, Jiajie Jin, Yinuo Wang, Hao Wang, Yutao Zhu, Ji-Rong Wen, Yuan Lu, Zhicheng Dou

Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']

Annotation:

Large reasoning models have demonstrated strong problem-solving abilities, yet real-world tasks often require external tools and long-horizon interactions. Existing agent frameworks typically follow predefined workflows, which limit autonomous and global task completion. In this paper, we introduce DeepAgent, an end-to-end deep reasoning agent that performs autonomous thinking, tool discovery, and action execution within a single, coherent reasoning process. To address the challenges of long-hor...

ID: 2510.21618v1 cs.AI, cs.CL, cs.IR, cs.LG

arXiv PDF

📄 PluriHop: Exhaustive, Recall-Sensitive QA over Distractor-Rich Corpora

2025-10-18

Авторы:

Mykolas Sveistrys, Richard Kunert

Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']

Annotation:

Recent advances in large language models (LLMs) and retrieval-augmented generation (RAG) have enabled progress on question answering (QA) when relevant evidence is in one (single-hop) or multiple (multi-hop) passages. Yet many realistic questions about recurring report data - medical records, compliance filings, maintenance logs - require aggregation across all documents, with no clear stopping point for retrieval and high sensitivity to even one missed passage. We term these pluri-hop questions...

ID: 2510.14377v1 cs.CL, cs.IR, cs.LG

arXiv PDF

📄 Agent Learning via Early Experience

2025-10-11

Авторы:

Kai Zhang, Xiangchao Chen, Bo Liu, Tianci Xue, Zeyi Liao, Zhihan Liu, Xiyao Wang, Yuting Ning, Zhaorun Chen, Xiaohan Fu, Jian Xie, Yuxuan Sun, Boyu Gou, Qi Qi, Zihang Meng, Jianwei Yang, Ning Zhang, Xian Li, Ashish Shah, Dat Huynh, Hengduo Li, Zi Yang, Sara Cao, Lawrence Jang, Shuyan Zhou, Jiacheng Zhu, Huan Sun, Jason Weston, Yu Su, Yifan Wu

Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']

Annotation:

A long-term goal of language agents is to learn and improve through their own experience, ultimately outperforming humans in complex, real-world tasks. However, training agents from experience data with reinforcement learning remains difficult in many environments, which either lack verifiable rewards (e.g., websites) or require inefficient long-horizon rollouts (e.g., multi-turn tool use). As a result, most current agents rely on supervised fine-tuning on expert data, which is challenging to sc...

ID: 2510.08558v1 cs.AI, cs.CL, cs.IR, cs.LG

arXiv PDF

📄 LLM-based Multi-Agent Blackboard System for Information Discovery in Data Science

2025-10-04

Авторы:

Alireza Salemi, Mihir Parmar, Palash Goyal, Yiwen Song, Jinsung Yoon, Hamed Zamani, Hamid Palangi, Tomas Pfister

Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']

Annotation:

The rapid advancement of Large Language Models (LLMs) has opened new opportunities in data science, yet their practical deployment is often constrained by the challenge of discovering relevant data within large heterogeneous data lakes. Existing methods struggle with this: single-agent systems are quickly overwhelmed by large, heterogeneous files in the large data lakes, while multi-agent systems designed based on a master-slave paradigm depend on a rigid central controller for task allocation t...

ID: 2510.01285v1 cs.MA, cs.AI, cs.CL, cs.IR, cs.LG

arXiv PDF

📄 Learning to Detect Relevant Contexts and Knowledge for Response Selection in Retrieval-based Dialogue Systems

2025-10-01

Авторы:

Kai Hua, Zhiyuan Feng, Chongyang Tao, Rui Yan, Lu Zhang

Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']

Annotation:

Recently, knowledge-grounded conversations in the open domain gain great attention from researchers. Existing works on retrieval-based dialogue systems have paid tremendous efforts to utilize neural networks to build a matching model, where all of the context and knowledge contents are used to match the response candidate with various representation methods. Actually, different parts of the context and knowledge are differentially important for recognizing the proper response candidate, as many ...

ID: 2509.22845v1 cs.CL, cs.IR, cs.LG, H.3.3; I.2.7; I.2.6

arXiv PDF

📄 PARROT: A Benchmark for Evaluating LLMs in Cross-System SQL Translation

2025-10-01

Авторы:

Wei Zhou, Guoliang Li, Haoyu Wang, Yuxing Han, Xufei Wu, Fan Wu, Xuanhe Zhou

## Контекст В последние годы large language models (LLM) продемонстрировали вполне убедительные результаты в задачах Text-to-SQL, где требуется преобразовать синтаксически корректный текстовый запрос в SQL-запрос, соответствующий логике БД. Однако существует еще одна важная задача, называемая Cross-System SQL Translation (чаще всего — SQL-to-SQL), которая заключается в переводе SQL-запроса, написанного для одной базы данных (например, MySQL), на соответствующий запрос, корректный для другой базы данных (например, ClickHouse). Эта задача является достаточно сложной, так как каждая база данных имеет свои особенности в синтаксисе, функциях и системных ограничениях. Несмотря на ее практическую важность, существующие бенчмарки для SQL-задач не очень подходят для эффективной оценки моделей в Cross-System SQL Translation, в основном из-за ограниченного набора систем, с которыми они работают, и неэффективности в отражении реальных системных различий. ## Метод PARROT (Practical And Realistic BenchmaRk for CrOss-System SQL Translation) — это новый бенчмарк для оценки моделей LLM в Cross-System SQL Translation. Он включает 598 пар запросов, полученных из 38 открытых баз данных и реальных бизнес-систем. Авторы специально подготовили эти пары, чтобы оценить то, насколько хорошо модели LLM понимают системно-зависимые различия в SQL. Для расширенного тестирования представлены два дополнительных варианта: PARROT-Diverse (28,003 пар для тестирования многообразия синтаксиса) и PARROT-Simple (5,306 пар для тестирования под конкретные ситуации). Все пары работают с 22 production-grade database systems, что делает PARROT одной из самых мощных и обширных баз для этих задач. Для поддержки будущих исследований авторы также выпустили открытый leaderboard и исходный код на сайте: https://code4db.github.io/parrot-bench/. ## Результаты Авторы провели эксперименты с несколькими популярными LLM, включая GPT-4, LLaMA и др., и оценивали их на PARROT, PARROT-Diverse и PARROT-Simple. Результаты показали, что даже самые продвинутые модели достигают низкую точность (менее 38.53% в среднем) при выполнении задач Cross-System SQL Translation. Это свидетельствует о том, что эта задача значительно сложнее Text-to-SQL и требует более специализированных подходов. Также были проведены тестирования на PARROT-Diverse и PARROT-Simple, которые показали, что LLM способны получать высокую точность на простых задачах, но сильно страдают при работе с системно-зависимыми различиями. ## Значимость PARROT является первым реальностью для эффективной оценки LLM в Cross-System SQL Translation. Его особенность заключается в том, что он хорошо отражает реальные различия систем, что не дает LLM просто "обмануть" бенчмарком, при этом оставаясь полезным для реальных бизнес-систем. Это открывает пути для развития моделей, кото

Annotation:

Large language models (LLMS) have shown increasing effectiveness in Text-to-SQL tasks. However, another closely related problem, Cross-System SQL Translation (a.k.a., SQL-to-SQL), which adapts a query written for one database system (e.g., MySQL) into its equivalent one for another system (e.g., ClickHouse), is of great practical importance but remains underexplored. Existing SQL benchmarks are not well-suited for SQL-to-SQL evaluation, which (1) focus on a limited set of database systems (often...

ID: 2509.23338v1 cs.DB, cs.AI, cs.CL, cs.IR, cs.LG

arXiv PDF

Показано 1 - 10 из 17 записей