📊 Статистика дайджестов

Всего дайджестов: 34022 Добавлено сегодня: 82

Последнее обновление: сегодня

📄 Stochastic Difference-of-Convex Optimization with Momentum

2025-10-22

Авторы:

El Mahdi Chayti, Martin Jaggi

Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']

Annotation:

Stochastic difference-of-convex (DC) optimization is prevalent in numerous machine learning applications, yet its convergence properties under small batch sizes remain poorly understood. Existing methods typically require large batches or strong noise assumptions, which limit their practical use. In this work, we show that momentum enables convergence under standard smoothness and bounded variance assumptions (of the concave part) for any batch size. We prove that without momentum, convergence m...

ID: 2510.17503v1 cs.LG, math.OC, stat.ML

arXiv PDF

📄 Cautious Weight Decay

2025-10-16

Авторы:

Lizhang Chen, Jonathan Li, Kaizhao Liang, Baiyu Su, Cong Xie, Nuo Wang Pierse, Chen Liang, Ni Lao, Qiang Liu

Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']

Annotation:

We introduce Cautious Weight Decay (CWD), a one-line, optimizer-agnostic modification that applies weight decay only to parameter coordinates whose signs align with the optimizer update. Unlike standard decoupled decay, which implicitly optimizes a regularized or constrained objective, CWD preserves the original loss and admits a bilevel interpretation: it induces sliding-mode behavior upon reaching the stationary manifold, allowing it to search for locally Pareto-optimal stationary points of th...

ID: 2510.12402v1 cs.LG, math.OC, stat.ML

arXiv PDF

📄 Oracle-based Uniform Sampling from Convex Bodies

2025-10-07

Авторы:

Thanh Dang, Jiaming Liang

Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']

Annotation:

We propose new Markov chain Monte Carlo algorithms to sample a uniform distribution on a convex body $K$. Our algorithms are based on the Alternating Sampling Framework/proximal sampler, which uses Gibbs sampling on an augmented distribution and assumes access to the so-called restricted Gaussian oracle (RGO). The key contribution of this work is the efficient implementation of RGO for uniform sampling on $K$ via rejection sampling and access to either a projection oracle or a separation oracle ...

ID: 2510.02983v1 cs.DS, cs.LG, math.OC, stat.ML

arXiv PDF

📄 Why Do We Need Warm-up? A Theoretical Perspective

2025-10-07

Авторы:

Foivos Alimisis, Rustem Islamov, Aurelien Lucchi

Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']

Annotation:

Learning rate warm-up - increasing the learning rate at the beginning of training - has become a ubiquitous heuristic in modern deep learning, yet its theoretical foundations remain poorly understood. In this work, we provide a principled explanation for why warm-up improves training. We rely on a generalization of the $(L_0, L_1)$-smoothness condition, which bounds local curvature as a linear function of the loss sub-optimality and exhibits desirable closure properties. We demonstrate both theo...

ID: 2510.03164v1 cs.LG, math.OC, stat.ML

arXiv PDF

📄 Error Feedback for Muon and Friends

2025-10-04

Авторы:

Kaja Gruntkowska, Alexander Gaponov, Zhirayr Tovmasyan, Peter Richtárik

Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']

Annotation:

Recent optimizers like Muon, Scion, and Gluon have pushed the frontier of large-scale deep learning by exploiting layer-wise linear minimization oracles (LMOs) over non-Euclidean norm balls, capturing neural network structure in ways traditional algorithms cannot. Yet, no principled distributed framework exists for these methods, and communication bottlenecks remain unaddressed. The very few distributed variants are heuristic, with no convergence guarantees in sight. We introduce EF21-Muon, the ...

ID: 2510.00643v1 cs.LG, math.OC, stat.ML

arXiv PDF

📄 Lower Bounds on Adversarial Robustness for Multiclass Classification with General Loss Functions

2025-10-04

Авторы:

Camilo Andrés García Trillos, Nicolás García Trillos

Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']

Annotation:

We consider adversarially robust classification in a multiclass setting under arbitrary loss functions and derive dual and barycentric reformulations of the corresponding learner-agnostic robust risk minimization problem. We provide explicit characterizations for important cases such as the cross-entropy loss, loss functions with a power form, and the quadratic loss, extending in this way available results for the 0-1 loss. These reformulations enable efficient computation of sharp lower bounds ...

ID: 2510.01969v1 cs.LG, math.OC, stat.ML

arXiv PDF

📄 Reinforcement Learning with Action-Triggered Observations

2025-10-04

Авторы:

Alexander Ryabchenko, Wenlong Mou

Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']

Annotation:

We study reinforcement learning problems where state observations are stochastically triggered by actions, a constraint common in many real-world applications. This framework is formulated as Action-Triggered Sporadically Traceable Markov Decision Processes (ATST-MDPs), where each action has a specified probability of triggering a state observation. We derive tailored Bellman optimality equations for this framework and introduce the action-sequence learning paradigm in which agents commit to exe...

ID: 2510.02149v1 cs.LG, math.OC, stat.ML, 68T05 (Primary), 62L05, 68W27 (Secondary)

arXiv PDF

📄 Drop-Muon: Update Less, Converge Faster

2025-10-04

Авторы:

Kaja Gruntkowska, Yassine Maziane, Zheng Qu, Peter Richtárik

Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']

Annotation:

Conventional wisdom in deep learning optimization dictates updating all layers at every step-a principle followed by all recent state-of-the-art optimizers such as Muon. In this work, we challenge this assumption, showing that full-network updates can be fundamentally suboptimal, both in theory and in practice. We introduce a non-Euclidean Randomized Progressive Training method-Drop-Muon-a simple yet powerful framework that updates only a subset of layers per step according to a randomized sched...

ID: 2510.02239v1 cs.LG, math.OC, stat.ML

arXiv PDF

📄 A Recovery Guarantee for Sparse Neural Networks

2025-09-26

Авторы:

Sara Fridovich-Keil, Mert Pilanci

## Контекст Modern machine learning relies heavily on neural networks, which are known for their expressive power but also for their high computational and memory demands. This poses significant challenges for deploying these models in resource-constrained environments, such as mobile devices and embedded systems. Sparse neural networks, which reduce the number of nonzero weights, offer a promising solution to these challenges. However, achieving sparse recovery—accurately recovering the sparse weight configuration of a neural network—remains a significant theoretical and practical problem. Existing approaches, such as iterative magnitude pruning, often struggle with efficiency and accuracy. This study addresses these limitations by providing the first theoretical guarantees for sparse recovery in ReLU neural networks, focusing on two-layer, scalar-output networks. ## Метод The proposed methodology centers on analyzing structural properties of sparse neural networks and developing an efficient recovery algorithm. Specifically, the study focuses on two-layer ReLU neural networks with scalar outputs. It introduces an iterative hard thresholding (IHT) algorithm, which systematically prunes small weights while updating remaining ones to optimize network performance. The algorithm operates with memory requirements that scale linearly with the number of nonzero weights, making it highly efficient. Structural assumptions, such as sparsity patterns and activation properties, are analyzed to ensure recovery guarantees. These theoretical insights are then validated through practical experiments on diverse tasks, including planted network recovery, MNIST classification, and implicit neural representation learning. ## Результаты Theoretical analysis demonstrates that the IHT algorithm can exactly recover sparse weight configurations of two-layer ReLU networks under specific structural conditions. Empirical experiments validate these findings. For instance, on planted MLP recovery tasks, the algorithm achieves perfect recovery with high probability while significantly reducing memory usage compared to baseline methods. In MNIST classification, sparse networks recovered by the IHT algorithm demonstrate competitive accuracy with a fraction of the parameters. Additionally, the method shows promise in implicit neural representations, where it outperforms iterative magnitude pruning in certain scenarios. These results highlight the robustness and efficiency of the proposed approach. ## Значимость The study provides a theoretical foundation for sparse recovery in ReLU neural networks, addressing a critical gap in the literature. Its practical implications are substantial: the proposed method offers a memory-efficient alternative to traditional pruning techniques, enabling the deployment of sparse neural networks on devices with limited computational resources. Potential applications include edge computing, mobile AI, and real-time processing. Furthermore, the findings contribute to the broader understanding of sparse optimization in neural networks, paving the way for advancements in model compression, interpretability, and energy efficiency. ## Выводы This work establishes the first recovery guarantees for sparse neural networks, showcasing the effectiveness of the IHT algorithm in recovering sparse weight configurations of two-layer ReLU networks. Experimental results demonstrate competitive performance compared to state-of-the-art methods, with significant memory savings. Future research will focus on extending these results to deeper networks, exploring the role of initialization in recovery guarantees, and developing adaptive pruning strategies for more complex architectures. These directions hold promise for advancing the scalability and efficiency of neural network deployment.

Annotation:

We prove the first guarantees of sparse recovery for ReLU neural networks, where the sparse network weights constitute the signal to be recovered. Specifically, we study structural properties of the sparse network weights for two-layer, scalar-output networks under which a simple iterative hard thresholding algorithm recovers these weights exactly, using memory that grows linearly in the number of nonzero weights. We validate this theoretical result with simple experiments on recovery of sparse ...

ID: 2509.20323v1 cs.LG, math.OC, stat.ML

arXiv PDF

📄 Diagonal Linear Networks and the Lasso Regularization Path

2025-09-25

Авторы:

Raphaël Berthier

## Контекст Область исследования связана с теорией обучения нейронных сетей, в частности диагональных линейных сетей. Эти сети представляют собой нейронные сети с линейными активациями и диагональными весовыми матрицами. Известно, что их теоретический анализ достаточно хорошо развит, в частности, известно, что при малой инициализации их оптимизационный процесс приводит к линейному предсказателю с минимальной 1-нормой среди минимизаторов функции потерь. Данное исследование стремится к углубленному анализу характера этого поведения, в частности, целью является установить связь между тренировочной траекторией диагональных линейных сетей и регуляризационной траекторией LASSO (Least Absolute Shrinkage and Selection Operator). Эта связь может быть полезна для понимания и моделирования оптимизационных процессов в нейронных сетях. ## Метод Методология исследования основывается на анализе гомотопических связей между обучающей траекторией диагональных линейных сетей и регуляризационной траекторией LASSO. Для этого воспользованысь теоремами геометрии оптимальных решений и анализом локальных поведений тренировочных процессов. Особое внимание уделяется анализу тренировочной траектории в зависимости от инициализационных параметров и регуляризационных параметров (в частности, времени тренировки). Архитектура исследуемых сетей определяется заданным количеством слоёв, диагональными матрицами весов и линейной активацией. Также включены симуляции для проверки теоретических выводов. ## Результаты Эксперименты показали, что тренировочная траектория диагональных линейных сетей может быть эквивалентна регуляризационной траектории LASSO при условии монотонности последней. В случае немотонности, получены аппроксимативные результаты, подтверждающие близость траекторий. Во время обучения веса диагональных сетей меняются таким образом, чтобы минимизировать отклонение от линейного предсказателя с минимальной 1-нормой. Данные эксперименты подтвердили теоретические предположения, показав четкую зависимость между временем обучения и регуляризационным параметром LASSO. ## Значимость Результаты имеют значительное значение для теоретического понимания нейронных сетей и их регуляризационных процессов. Эта связь даёт возможность переносить знания и методы регуляризации LASSO на другие модели, такие как нейронные сети с диагональными матрицами. Будущие исследования могут быть направлены на расширение этой модели на более сложные сети, такие как нелинейные, и на изучение других регуляризаторов. ## Выводы В ходе исследования была у

Annotation:

Diagonal linear networks are neural networks with linear activation and diagonal weight matrices. Their theoretical interest is that their implicit regularization can be rigorously analyzed: from a small initialization, the training of diagonal linear networks converges to the linear predictor with minimal 1-norm among minimizers of the training loss. In this paper, we deepen this analysis showing that the full training trajectory of diagonal linear networks is closely related to the lasso regul...

ID: 2509.18766v1 cs.LG, math.OC, stat.ML, 62J07, 68T07, G.3

arXiv PDF

Показано 11 - 20 из 34 записей