📊 Статистика дайджестов

Всего дайджестов: 34022 Добавлено сегодня: 0

Последнее обновление: сегодня

📄 Rethinking LLM Evaluation: Can We Evaluate LLMs with 200x Less Data?

2025-10-15

Авторы:

Shaobo Wang, Cong Wang, Wenjie Fu, Yue Min, Mingquan Feng, Isabel Guan, Xuming Hu, Conghui He, Cunxiang Wang, Kexin Yang, Xingzhang Ren, Fei Huang, Dayiheng Liu, Linfeng Zhang

Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']

Annotation:

As the demand for comprehensive evaluations of diverse model capabilities steadily increases, benchmark suites have correspondingly grown significantly in scale. Despite notable advances in redundancy reduction and subset-level performance prediction, a systematic framework that effectively integrates these methods to ensure both prediction accuracy and ranking consistency is still largely elusive. In this paper, we first perform a sample-level analysis of benchmark redundancy and identify sever...

ID: 2510.10457v1 cs.CL, cs.LG

arXiv PDF

📄 Merlin's Whisper: Enabling Efficient Reasoning in LLMs via Black-box Adversarial Prompting

2025-10-15

Авторы:

Heming Xia, Cunxiao Du, Rui Li, Chak Tou Leong, Yongqi Li, Wenjie Li

Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']

Annotation:

Large reasoning models (LRMs) have demonstrated remarkable proficiency in tackling complex reasoning tasks through step-by-step thinking. However, such a lengthy reasoning process incurs substantial computational and latency overheads, hindering the practical deployment of these models. In this work, we present a new perspective on mitigating overthinking in LRMs via black-box adversarial prompting. By treating both open-source LRMs and closed-source APIs as black-box communicators, we investiga...

ID: 2510.10528v1 cs.CL, cs.LG

arXiv PDF

📄 RePro: Training Language Models to Faithfully Recycle the Web for Pretraining

2025-10-15

Авторы:

Zichun Yu, Chenyan Xiong

Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']

Annotation:

High-quality pretraining data is the fossil fuel of large language models (LLMs), yet its reserves are running low for frontier models. In this paper, we introduce RePro, a novel web recycling method that trains a relatively small LM with reinforcement learning to generate effective and faithful rephrasings of pretraining data. Specifically, we design one quality reward and three faithfulness rewards, optimizing the LM rephraser to convert organic data into high-quality rephrasings while maintai...

ID: 2510.10681v1 cs.CL, cs.LG

arXiv PDF

📄 End-to-end Sequence Labeling via Bi-directional LSTM-CNNs-CRF: A Reproducibility Study

2025-10-15

Авторы:

Anirudh Ganesh, Jayavardhan Reddy

Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']

Annotation:

We present a reproducibility study of the state-of-the-art neural architecture for sequence labeling proposed by Ma and Hovy (2016)\cite{ma2016end}. The original BiLSTM-CNN-CRF model combines character-level representations via Convolutional Neural Networks (CNNs), word-level context modeling through Bi-directional Long Short-Term Memory networks (BiLSTMs), and structured prediction using Conditional Random Fields (CRFs). This end-to-end approach eliminates the need for hand-crafted features whi...

ID: 2510.10936v1 cs.CL, cs.LG

arXiv PDF

📄 Discursive Circuits: How Do Language Models Understand Discourse Relations?

2025-10-15

Авторы:

Yisong Miao, Min-Yen Kan

Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']

Annotation:

Which components in transformer language models are responsible for discourse understanding? We hypothesize that sparse computational graphs, termed as discursive circuits, control how models process discourse relations. Unlike simpler tasks, discourse relations involve longer spans and complex reasoning. To make circuit discovery feasible, we introduce a task called Completion under Discourse Relation (CuDR), where a model completes a discourse given a specified relation. To support this task, ...

ID: 2510.11210v1 cs.CL, cs.LG

arXiv PDF

📄 Deconstructing Attention: Investigating Design Principles for Effective Language Modeling

2025-10-15

Авторы:

Huiyin Xue, Nafise Sadat Moosavi, Nikolaos Aletras

Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']

Annotation:

The success of Transformer language models is widely credited to their dot-product attention mechanism, which interweaves a set of key design principles: mixing information across positions (enabling multi-token interactions), sequence-dependent activations (where attention weights adapt to each input), a specific mathematical form (dot-product similarities plus softmax weighting), and coupling of queries and keys to evolving hidden states (grounding attention in the current layer). However, the...

ID: 2510.11602v1 cs.CL, cs.LG

arXiv PDF

📄 Are Large Reasoning Models Interruptible?

2025-10-15

Авторы:

Tsung-Han Wu, Mihran Miroyan, David M. Chan, Trevor Darrell, Narges Norouzi, Joseph E. Gonzalez

Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']

Annotation:

Large Reasoning Models (LRMs) excel at complex reasoning but are traditionally evaluated in static, "frozen world" settings: model responses are assumed to be instantaneous, and the context of a request is presumed to be immutable over the duration of the response. While generally true for short-term tasks, the "frozen world" assumption breaks down in modern reasoning tasks such as assistive programming, where models may take hours to think through problems and code may change dramatically from ...

ID: 2510.11713v2 cs.CL, cs.LG

arXiv PDF

📄 Diffusion-Link: Diffusion Probabilistic Model for Bridging the Audio-Text Modality Gap

2025-10-15

Авторы:

KiHyun Nam, Jongmin Choi, Hyeongkeun Lee, Jungwoo Heo, Joon Son Chung

Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']

Annotation:

Contrastive audio-language pretraining yields powerful joint representations, yet a persistent audio-text modality gap limits the benefits of coupling multimodal encoders with large language models (LLMs). We present Diffusion-Link, a diffusion-based modality-bridging module that generatively maps audio embeddings into the text-embedding distribution. The module is trained at the output embedding from the frozen multimodal encoder and implemented as a lightweight network with three residual MLP ...

ID: 2510.11330v1 cs.SD, cs.AI, cs.CL, cs.LG, eess.AS

arXiv PDF

📄 How Reliable is Language Model Micro-Benchmarking?

2025-10-14

Авторы:

Gregory Yauney, Shahzaib Saqib Warraich, Swabha Swayamdipta

Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']

Annotation:

Micro-benchmarking offers a solution to the often prohibitive time and cost of language model development: evaluate on a very small subset of existing benchmarks. Can these micro-benchmarks, however, rank models as consistently as the full benchmarks they replace? And can they rank models more consistently than selecting a random subset of data points? In many scenarios, we find that the answer is no. We introduce a meta-evaluation measure for micro-benchmarking which investigates how well a mic...

ID: 2510.08730v1 cs.CL, cs.LG

arXiv PDF

📄 A Design-based Solution for Causal Inference with Text: Can a Language Model Be Too Large?

2025-10-14

Авторы:

Graham Tierney, Srikar Katta, Christopher Bail, Sunshine Hillygus, Alexander Volfovsky

Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']

Annotation:

Many social science questions ask how linguistic properties causally affect an audience's attitudes and behaviors. Because text properties are often interlinked (e.g., angry reviews use profane language), we must control for possible latent confounding to isolate causal effects. Recent literature proposes adapting large language models (LLMs) to learn latent representations of text that successfully predict both treatment and the outcome. However, because the treatment is a component of the text...

ID: 2510.08758v1 stat.ME, cs.CL, cs.LG, stat.AP

arXiv PDF

Показано 221 - 230 из 573 записей