Don't Be Greedy, Just Relax! Pruning LLMs via Frank-Wolfe
2510.13713v1
cs.LG, math.OC
2025-10-17
Авторы:
Christophe Roux, Max Zimmer, Alexandre d'Aspremont, Sebastian Pokutta
Abstract
Pruning is a common technique to reduce the compute and storage requirements
of Neural Networks. While conventional approaches typically retrain the model
to recover pruning-induced performance degradation, state-of-the-art Large
Language Model (LLM) pruning methods operate layer-wise, minimizing the
per-layer pruning error on a small calibration dataset to avoid full
retraining, which is considered computationally prohibitive for LLMs. However,
finding the optimal pruning mask is a hard combinatorial problem and solving it
to optimality is intractable. Existing methods hence rely on greedy heuristics
that ignore the weight interactions in the pruning objective. In this work, we
instead consider the convex relaxation of these combinatorial constraints and
solve the resulting problem using the Frank-Wolfe (FW) algorithm. Our method
drastically reduces the per-layer pruning error, outperforms strong baselines
on state-of-the-art GPT architectures, and remains memory-efficient. We provide
theoretical justification by showing that, combined with the convergence
guarantees of the FW algorithm, we obtain an approximate solution to the
original combinatorial problem upon rounding the relaxed solution to
integrality.
Ссылки и действия
Дополнительные ресурсы: