📊 Статистика дайджестов
Всего дайджестов: 34022 Добавлено сегодня: 82
Последнее обновление: сегодня
📄 Diagonalizing the Softmax: Hadamard Initialization for Tractable Cross-Entropy Dynamics
2025-12-05Авторы:
Connall Garrod, Jonathan P. Keating, Christos Thrampoulidis
Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']
Annotation:
Cross-entropy (CE) training loss dominates deep learning practice, yet existing theory often relies on simplifications, either replacing it with squared loss or restricting to convex models, that miss essential behavior. CE and squared loss generate fundamentally different dynamics, and convex linear models cannot capture the complexities of non-convex optimization. We provide an in-depth characterization of multi-class CE optimization dynamics beyond the convex regime by analyzing a canonical t...
Авторы:
Damek Davis, Dmitriy Drusvyatskiy
Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']
Annotation:
Spectral gradient methods, such as the recently popularized Muon optimizer, are a promising alternative to standard Euclidean gradient descent for training deep neural networks and transformers, but it is still unclear in which regimes they are expected to perform better. We propose a simple layerwise condition that predicts when a spectral update yields a larger decrease in the loss than a Euclidean gradient step. This condition compares, for each parameter block, the squared nuclear-to-Frobeni...
Авторы:
Kaiyi Ji
Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']
Annotation:
Although upper bound guarantees for bilevel optimization have been widely studied, progress on lower bounds has been limited due to the complexity of the bilevel structure. In this work, we focus on the smooth nonconvex-strongly-convex setting and develop new hard instances that yield nontrivial lower bounds under deterministic and stochastic first-order oracle models. In the deterministic case, we prove that any first-order zero-respecting algorithm requires at least $Ω(κ^{3/2}ε^{-2})$ oracle c...
📄 Adaptivity and Universality: Problem-dependent Universal Regret for Online Convex Optimization
2025-11-27Авторы:
Peng Zhao, Yu-Hu Yan, Hang Yu, Zhi-Hua Zhou
Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']
Annotation:
Universal online learning aims to achieve optimal regret guarantees without requiring prior knowledge of the curvature of online functions. Existing methods have established minimax-optimal regret bounds for universal online learning, where a single algorithm can simultaneously attain $\mathcal{O}(\sqrt{T})$ regret for convex functions, $\mathcal{O}(d \log T)$ for exp-concave functions, and $\mathcal{O}(\log T)$ for strongly convex functions, where $T$ is the number of rounds and $d$ is the dime...
Авторы:
Wei-Cheng Lee, Francesco Orabona
Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']
Annotation:
In this short note, we present a simple derivation of the best-of-both-world guarantee for the Tsallis-INF multi-armed bandit algorithm from J. Zimmert and Y. Seldin. Tsallis-INF: An optimal algorithm for stochastic and adversarial bandits. Journal of Machine Learning Research, 22(28):1-49, 2021. URL https://jmlr.csail.mit.edu/papers/volume22/19-753/19-753.pdf. In particular, the proof uses modern tools from online convex optimization and avoid the use of conjugate functions. Also, we do not opt...
Авторы:
Semih Cayci
Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']
Annotation:
An important question in deep learning is how higher-order optimization
methods affect generalization. In this work, we analyze a stochastic
Gauss-Newton (SGN) method with Levenberg-Marquardt damping and mini-batch
sampling for training overparameterized deep neural networks with smooth
activations in a regression setting. Our theoretical contributions are twofold.
First, we establish finite-time convergence bounds via a variable-metric
analysis in parameter space, with explicit dependencies on ...
Авторы:
Xuheng Li, Quanquan Gu
Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']
Annotation:
Variance-dependent regret bounds have received increasing attention in recent
studies on contextual bandits. However, most of these studies are focused on
upper confidence bound (UCB)-based bandit algorithms, while sampling based
bandit algorithms such as Thompson sampling are still understudied. The only
exception is the LinVDTS algorithm (Xu et al., 2023), which is limited to
linear reward function and its regret bound is not optimal with respect to the
model dimension. In this paper, we prese...
📄 ConMeZO: Adaptive Descent-Direction Sampling for Gradient-Free Finetuning of Large Language Models
2025-11-06Авторы:
Lejs Deen Behric, Liang Zhang, Bingcong Li, Kiran Koshy Thekumparampil
Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']
Annotation:
Zeroth-order or derivative-free optimization (MeZO) is an attractive strategy
for finetuning large language models (LLMs) because it eliminates the memory
overhead of backpropagation. However, it converges slowly due to the inherent
curse of dimensionality when searching for descent directions in the
high-dimensional parameter space of billion-scale LLMs. We propose ConMeZO, a
novel zeroth-order optimizer that accelerates convergence by adaptive
directional sampling. Instead of drawing the direc...
📄 Rethinking PCA Through Duality
2025-10-23Авторы:
Jan Quan, Johan Suykens, Panagiotis Patrinos
Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']
Annotation:
Motivated by the recently shown connection between self-attention and
(kernel) principal component analysis (PCA), we revisit the fundamentals of
PCA. Using the difference-of-convex (DC) framework, we present several novel
formulations and provide new theoretical insights. In particular, we show the
kernelizability and out-of-sample applicability for a PCA-like family of
problems. Moreover, we uncover that simultaneous iteration, which is connected
to the classical QR algorithm, is an instance o...
Авторы:
Jan Sobotka, Petr Šimánek, Pavel Kordík
Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']
Annotation:
Fractional Gradient Descent (FGD) offers a novel and promising way to
accelerate optimization by incorporating fractional calculus into machine
learning. Although FGD has shown encouraging initial results across various
optimization tasks, it faces significant challenges with convergence behavior
and hyperparameter selection. Moreover, the impact of its hyperparameters is
not fully understood, and scheduling them is particularly difficult in
non-convex settings such as neural network training. T...
Показано 1 -
10
из 34 записей