📊 Статистика дайджестов

Всего дайджестов: 34022 Добавлено сегодня: 82

Последнее обновление: сегодня

📄 Counting Without Running: Evaluating LLMs' Reasoning About Code Complexity

2025-12-05

Авторы:

Gregory Bolet, Giorgis Georgakoudis, Konstantinos Parasyris, Harshitha Menon, Niranjan Hasabnis, Kirk W. Cameron, Gal Oren

Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']

Annotation:

Modern GPU software stacks demand developers who can anticipate performance bottlenecks before ever launching a kernel; misjudging floating-point workloads upstream can derail tuning, scheduling, and even hardware procurement. Yet despite rapid progress in code generation, today's Large Language Models (LLMs) are rarely tested on this kind of forward-looking reasoning. We close that gap with gpuFLOPBench, a benchmark that asks models to "count without running" by predicting single and double-pre...

ID: 2512.04355v1 cs.DC, cs.AI, cs.PF

arXiv PDF

📄 Tangram: Accelerating Serverless LLM Loading through GPU Memory Reuse and Affinity

2025-12-04

Авторы:

Wenbin Zhu, Zhaoyan Shen, Zili Shao, Hongjun Dai, Feng Chen

Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']

Annotation:

Serverless Large Language Models (LLMs) have emerged as a cost-effective solution for deploying AI services by enabling a 'pay-as-you-go' pricing model through GPU resource sharing. However, cold-start latency, especially the model loading phase, has become a critical performance bottleneck, as it scales linearly with model size and severely limits the practical deployment of large-scale LLM services. This paper presents Tangram, a novel system that accelerates Serverless LLM loading through eff...

ID: 2512.01357v1 cs.DC, cs.AI, cs.AR

arXiv PDF

📄 Delta Sum Learning: an approach for fast and global convergence in Gossip Learning

2025-12-04

Авторы:

Tom Goethals, Merlijn Sebrechts, Stijn De Schrijver, Filip De Turck, Bruno Volckaert

Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']

Annotation:

Federated Learning is a popular approach for distributed learning due to its security and computational benefits. With the advent of powerful devices in the network edge, Gossip Learning further decentralizes Federated Learning by removing centralized integration and relying fully on peer to peer updates. However, the averaging methods generally used in both Federated and Gossip Learning are not ideal for model accuracy and global convergence. Additionally, there are few options to deploy Learni...

ID: 2512.01549v1 cs.DC, cs.AI

arXiv PDF

📄 Serving Heterogeneous LoRA Adapters in Distributed LLM Inference Systems

2025-12-02

Авторы:

Shashwat Jaiswal, Shrikara Arun, Anjaly Parayil, Ankur Mallick, Spyros Mastorakis, Alind Khare, Chloi Alverti, Renee St Amant, Chetan Bansal, Victor Rühle, Josep Torrellas

Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']

Annotation:

Low-Rank Adaptation (LoRA) has become the de facto method for parameter-efficient fine-tuning of large language models (LLMs), enabling rapid adaptation to diverse domains. In production, LoRA-based models are served at scale, creating multi-tenant environments with hundreds of adapters sharing a base model. However, state-of-the-art serving systems co-batch heterogeneous adapters without accounting for rank (size) variability, leading to severe performance skew, which ultimately requires adding...

ID: 2511.22880v1 cs.DC, cs.AI, cs.LG

arXiv PDF

📄 IslandRun: Privacy-Aware Multi-Objective Orchestration for Distributed AI Inference

2025-12-02

Авторы:

Bala Siva Sai Akhil Malepati

Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']

Annotation:

Modern AI inference faces an irreducible tension: no single computational resource simultaneously maximizes performance, preserves privacy, minimizes cost, and maintains trust. Existing orchestration frameworks optimize single dimensions (Kubernetes prioritizes latency, federated learning preserves privacy, edge computing reduces network distance), creating solutions that struggle under real-world heterogeneity. We present IslandRun, a multi-objective orchestration system that treats computation...

ID: 2512.00595v1 cs.DC, cs.AI, cs.CR

arXiv PDF

📄 Automated Dynamic AI Inference Scaling on HPC-Infrastructure: Integrating Kubernetes, Slurm and vLLM

2025-11-27

Авторы:

Tim Trappen, Robert Keßler, Roland Pabel, Viktor Achter, Stefan Wesner

Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']

Annotation:

Due to rising demands for Artificial Inteligence (AI) inference, especially in higher education, novel solutions utilising existing infrastructure are emerging. The utilisation of High-Performance Computing (HPC) has become a prevalent approach for the implementation of such solutions. However, the classical operating model of HPC does not adapt well to the requirements of synchronous, user-facing dynamic AI application workloads. In this paper, we propose our solution that serves LLMs by integr...

ID: 2511.21413v1 cs.DC, cs.AI, cs.DB, cs.PF

arXiv PDF

📄 SparOA: Sparse and Operator-aware Hybrid Scheduling for Edge DNN Inference

2025-11-26

Авторы:

Ziyang Zhang, Jie Liu, Luca Mottola

Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']

Annotation:

The resource demands of deep neural network (DNN) models introduce significant performance challenges, especially when deployed on resource-constrained edge devices. Existing solutions like model compression often sacrifice accuracy, while specialized hardware remains costly and inflexible. Hybrid inference methods, however, typically overlook how operator characteristics impact performance. In this work, we present SparOA, a CPU-GPU hybrid inference framework, which leverages both sparsity and ...

ID: 2511.19457v1 cs.DC, cs.AI

arXiv PDF

📄 Systemic approach for modeling a generic smart grid

2025-11-26

Авторы:

Sofiane Ben Amor, Guillaume Guerard, Loup-Noé Levy

Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']

Annotation:

Smart grid technological advances present a recent class of complex interdisciplinary modeling and increasingly difficult simulation problems to solve using traditional computational methods. To simulate a smart grid requires a systemic approach to integrated modeling of power systems, energy markets, demand-side management, and much other resources and assets that are becoming part of the current paradigm of the power grid. This paper presents a backbone model of a smart grid to test alternativ...

ID: 2511.19460v1 cs.DC, cs.AI, eess.SY

arXiv PDF

📄 Temperature in SLMs: Impact on Incident Categorization in On-Premises Environments

2025-11-26

Авторы:

Marcio Pohlmann, Alex Severo, Gefté Almeida, Diego Kreutz, Tiago Heinrich, Lourenço Pereira

Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']

Annotation:

SOCs and CSIRTs face increasing pressure to automate incident categorization, yet the use of cloud-based LLMs introduces costs, latency, and confidentiality risks. We investigate whether locally executed SLMs can meet this challenge. We evaluated 21 models ranging from 1B to 20B parameters, varying the temperature hyperparameter and measuring execution time and precision across two distinct architectures. The results indicate that temperature has little influence on performance, whereas the numb...

ID: 2511.19464v1 cs.DC, cs.AI, cs.CR, cs.LG, cs.PF

arXiv PDF

📄 Beluga: A CXL-Based Memory Architecture for Scalable and Efficient LLM KVCache Management

2025-11-26

Авторы:

Xinjun Yang, Qingda Hu, Junru Li, Feifei Li, Yuqi Zhou, Yicong Zhu, Qiuru Lin, Jian Dai, Yang Kong, Jiayu Zhang, Guoqiang Xu, Qiang Liu

Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']

Annotation:

The rapid increase in LLM model sizes and the growing demand for long-context inference have made memory a critical bottleneck in GPU-accelerated serving systems. Although high-bandwidth memory (HBM) on GPUs offers fast access, its limited capacity necessitates reliance on host memory (CPU DRAM) to support larger working sets such as the KVCache. However, the maximum DRAM capacity is constrained by the limited number of memory channels per CPU socket. To overcome this limitation, current systems...

ID: 2511.20172v1 cs.DC, cs.AI

arXiv PDF

Показано 1 - 10 из 86 записей