📊 Статистика дайджестов
Всего дайджестов: 34022 Добавлено сегодня: 82
Последнее обновление: сегодня
Авторы:
El Mahdi Chayti, Martin Jaggi
Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']
Annotation:
Stochastic difference-of-convex (DC) optimization is prevalent in numerous
machine learning applications, yet its convergence properties under small batch
sizes remain poorly understood. Existing methods typically require large
batches or strong noise assumptions, which limit their practical use. In this
work, we show that momentum enables convergence under standard smoothness and
bounded variance assumptions (of the concave part) for any batch size. We prove
that without momentum, convergence m...
📄 Cautious Weight Decay
2025-10-16Авторы:
Lizhang Chen, Jonathan Li, Kaizhao Liang, Baiyu Su, Cong Xie, Nuo Wang Pierse, Chen Liang, Ni Lao, Qiang Liu
Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']
Annotation:
We introduce Cautious Weight Decay (CWD), a one-line, optimizer-agnostic
modification that applies weight decay only to parameter coordinates whose
signs align with the optimizer update. Unlike standard decoupled decay, which
implicitly optimizes a regularized or constrained objective, CWD preserves the
original loss and admits a bilevel interpretation: it induces sliding-mode
behavior upon reaching the stationary manifold, allowing it to search for
locally Pareto-optimal stationary points of th...
Авторы:
Thanh Dang, Jiaming Liang
Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']
Annotation:
We propose new Markov chain Monte Carlo algorithms to sample a uniform
distribution on a convex body $K$. Our algorithms are based on the Alternating
Sampling Framework/proximal sampler, which uses Gibbs sampling on an augmented
distribution and assumes access to the so-called restricted Gaussian oracle
(RGO). The key contribution of this work is the efficient implementation of RGO
for uniform sampling on $K$ via rejection sampling and access to either a
projection oracle or a separation oracle ...
Авторы:
Foivos Alimisis, Rustem Islamov, Aurelien Lucchi
Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']
Annotation:
Learning rate warm-up - increasing the learning rate at the beginning of
training - has become a ubiquitous heuristic in modern deep learning, yet its
theoretical foundations remain poorly understood. In this work, we provide a
principled explanation for why warm-up improves training. We rely on a
generalization of the $(L_0, L_1)$-smoothness condition, which bounds local
curvature as a linear function of the loss sub-optimality and exhibits
desirable closure properties. We demonstrate both theo...
📄 Error Feedback for Muon and Friends
2025-10-04Авторы:
Kaja Gruntkowska, Alexander Gaponov, Zhirayr Tovmasyan, Peter Richtárik
Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']
Annotation:
Recent optimizers like Muon, Scion, and Gluon have pushed the frontier of
large-scale deep learning by exploiting layer-wise linear minimization oracles
(LMOs) over non-Euclidean norm balls, capturing neural network structure in
ways traditional algorithms cannot. Yet, no principled distributed framework
exists for these methods, and communication bottlenecks remain unaddressed. The
very few distributed variants are heuristic, with no convergence guarantees in
sight. We introduce EF21-Muon, the ...
📄 Lower Bounds on Adversarial Robustness for Multiclass Classification with General Loss Functions
2025-10-04Авторы:
Camilo Andrés García Trillos, Nicolás García Trillos
Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']
Annotation:
We consider adversarially robust classification in a multiclass setting under
arbitrary loss functions and derive dual and barycentric reformulations of the
corresponding learner-agnostic robust risk minimization problem. We provide
explicit characterizations for important cases such as the cross-entropy loss,
loss functions with a power form, and the quadratic loss, extending in this way
available results for the 0-1 loss. These reformulations enable efficient
computation of sharp lower bounds ...
Авторы:
Alexander Ryabchenko, Wenlong Mou
Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']
Annotation:
We study reinforcement learning problems where state observations are
stochastically triggered by actions, a constraint common in many real-world
applications. This framework is formulated as Action-Triggered Sporadically
Traceable Markov Decision Processes (ATST-MDPs), where each action has a
specified probability of triggering a state observation. We derive tailored
Bellman optimality equations for this framework and introduce the
action-sequence learning paradigm in which agents commit to exe...
📄 Drop-Muon: Update Less, Converge Faster
2025-10-04Авторы:
Kaja Gruntkowska, Yassine Maziane, Zheng Qu, Peter Richtárik
Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']
Annotation:
Conventional wisdom in deep learning optimization dictates updating all
layers at every step-a principle followed by all recent state-of-the-art
optimizers such as Muon. In this work, we challenge this assumption, showing
that full-network updates can be fundamentally suboptimal, both in theory and
in practice. We introduce a non-Euclidean Randomized Progressive Training
method-Drop-Muon-a simple yet powerful framework that updates only a subset of
layers per step according to a randomized sched...
Авторы:
Sara Fridovich-Keil, Mert Pilanci
## Контекст
Modern machine learning relies heavily on neural networks, which are known for their expressive power but also for their high computational and memory demands. This poses significant challenges for deploying these models in resource-constrained environments, such as mobile devices and embedded systems. Sparse neural networks, which reduce the number of nonzero weights, offer a promising solution to these challenges. However, achieving sparse recovery—accurately recovering the sparse weight configuration of a neural network—remains a significant theoretical and practical problem. Existing approaches, such as iterative magnitude pruning, often struggle with efficiency and accuracy. This study addresses these limitations by providing the first theoretical guarantees for sparse recovery in ReLU neural networks, focusing on two-layer, scalar-output networks.
## Метод
The proposed methodology centers on analyzing structural properties of sparse neural networks and developing an efficient recovery algorithm. Specifically, the study focuses on two-layer ReLU neural networks with scalar outputs. It introduces an iterative hard thresholding (IHT) algorithm, which systematically prunes small weights while updating remaining ones to optimize network performance. The algorithm operates with memory requirements that scale linearly with the number of nonzero weights, making it highly efficient. Structural assumptions, such as sparsity patterns and activation properties, are analyzed to ensure recovery guarantees. These theoretical insights are then validated through practical experiments on diverse tasks, including planted network recovery, MNIST classification, and implicit neural representation learning.
## Результаты
Theoretical analysis demonstrates that the IHT algorithm can exactly recover sparse weight configurations of two-layer ReLU networks under specific structural conditions. Empirical experiments validate these findings. For instance, on planted MLP recovery tasks, the algorithm achieves perfect recovery with high probability while significantly reducing memory usage compared to baseline methods. In MNIST classification, sparse networks recovered by the IHT algorithm demonstrate competitive accuracy with a fraction of the parameters. Additionally, the method shows promise in implicit neural representations, where it outperforms iterative magnitude pruning in certain scenarios. These results highlight the robustness and efficiency of the proposed approach.
## Значимость
The study provides a theoretical foundation for sparse recovery in ReLU neural networks, addressing a critical gap in the literature. Its practical implications are substantial: the proposed method offers a memory-efficient alternative to traditional pruning techniques, enabling the deployment of sparse neural networks on devices with limited computational resources. Potential applications include edge computing, mobile AI, and real-time processing. Furthermore, the findings contribute to the broader understanding of sparse optimization in neural networks, paving the way for advancements in model compression, interpretability, and energy efficiency.
## Выводы
This work establishes the first recovery guarantees for sparse neural networks, showcasing the effectiveness of the IHT algorithm in recovering sparse weight configurations of two-layer ReLU networks. Experimental results demonstrate competitive performance compared to state-of-the-art methods, with significant memory savings. Future research will focus on extending these results to deeper networks, exploring the role of initialization in recovery guarantees, and developing adaptive pruning strategies for more complex architectures. These directions hold promise for advancing the scalability and efficiency of neural network deployment.
Annotation:
We prove the first guarantees of sparse recovery for ReLU neural networks,
where the sparse network weights constitute the signal to be recovered.
Specifically, we study structural properties of the sparse network weights for
two-layer, scalar-output networks under which a simple iterative hard
thresholding algorithm recovers these weights exactly, using memory that grows
linearly in the number of nonzero weights. We validate this theoretical result
with simple experiments on recovery of sparse ...
Авторы:
Raphaël Berthier
## Контекст
Область исследования связана с теорией обучения нейронных сетей, в частности диагональных линейных сетей. Эти сети представляют собой нейронные сети с линейными активациями и диагональными весовыми матрицами. Известно, что их теоретический анализ достаточно хорошо развит, в частности, известно, что при малой инициализации их оптимизационный процесс приводит к линейному предсказателю с минимальной 1-нормой среди минимизаторов функции потерь. Данное исследование стремится к углубленному анализу характера этого поведения, в частности, целью является установить связь между тренировочной траекторией диагональных линейных сетей и регуляризационной траекторией LASSO (Least Absolute Shrinkage and Selection Operator). Эта связь может быть полезна для понимания и моделирования оптимизационных процессов в нейронных сетях.
## Метод
Методология исследования основывается на анализе гомотопических связей между обучающей траекторией диагональных линейных сетей и регуляризационной траекторией LASSO. Для этого воспользованысь теоремами геометрии оптимальных решений и анализом локальных поведений тренировочных процессов. Особое внимание уделяется анализу тренировочной траектории в зависимости от инициализационных параметров и регуляризационных параметров (в частности, времени тренировки). Архитектура исследуемых сетей определяется заданным количеством слоёв, диагональными матрицами весов и линейной активацией. Также включены симуляции для проверки теоретических выводов.
## Результаты
Эксперименты показали, что тренировочная траектория диагональных линейных сетей может быть эквивалентна регуляризационной траектории LASSO при условии монотонности последней. В случае немотонности, получены аппроксимативные результаты, подтверждающие близость траекторий. Во время обучения веса диагональных сетей меняются таким образом, чтобы минимизировать отклонение от линейного предсказателя с минимальной 1-нормой. Данные эксперименты подтвердили теоретические предположения, показав четкую зависимость между временем обучения и регуляризационным параметром LASSO.
## Значимость
Результаты имеют значительное значение для теоретического понимания нейронных сетей и их регуляризационных процессов. Эта связь даёт возможность переносить знания и методы регуляризации LASSO на другие модели, такие как нейронные сети с диагональными матрицами. Будущие исследования могут быть направлены на расширение этой модели на более сложные сети, такие как нелинейные, и на изучение других регуляризаторов.
## Выводы
В ходе исследования была у
Annotation:
Diagonal linear networks are neural networks with linear activation and
diagonal weight matrices. Their theoretical interest is that their implicit
regularization can be rigorously analyzed: from a small initialization, the
training of diagonal linear networks converges to the linear predictor with
minimal 1-norm among minimizers of the training loss. In this paper, we deepen
this analysis showing that the full training trajectory of diagonal linear
networks is closely related to the lasso regul...
Показано 11 -
20
из 34 записей