CAST: Continuous and Differentiable Semi-Structured Sparsity-Aware Training for Large Language Models
2509.25996v1
cs.LG, cs.CL
2025-10-02
Авторы:
Weiyu Huang, Yuezhou Hu, Jun Zhu, Jianfei Chen
Abstract
Sparsity-aware training is an effective approach for transforming large
language models (LLMs) into hardware-friendly sparse patterns, thereby reducing
latency and memory consumption during inference. In this paper, we propose
Continuous Adaptive Sparse Trainer (CAST), a fully continuous and
differentiable sparsity-aware training framework for semi-structured (or "N:M")
sparse models. Unlike previous approaches that optimize sparsity patterns and
weights separately, CAST enables seamless joint optimization during training,
while progressively transforming the model into the desired sparsity format.
Specifically, CAST introduces three key components: 1) AdamS, a sparsity-aware
optimizer that leverages adaptive L1 decay to promote uniform sparsification
across all parameters; 2) Weight Scaling, a module designed to mitigate the
magnitude reduction caused by decay while preserving desired sparsity patterns;
3) Knowledge Distillation, which employs the dense model as a self-teacher to
enhance training efficiency. We evaluate CAST under 2:4 sparsity patterns
across multiple model families, ranging from 125M to 13B parameters. Our
results demonstrate significant improvements over previous state-of-the-art
methods in both perplexity and zero-shot accuracy with minimal training
resources. Notably, on LLaMA2-7B, our 2:4 sparse model achieves a negligible
perplexity increase of 0.09 and a 0.36% gain in zero-shot accuracy compared to
the dense model using only 2% of the original pretraining tokens. Additionally,
we establish an accurate and robust empirical scaling law to predict sparse
model performance given adequate training resources. Finally, we demonstrate
the practical applicability of our sparse models by evaluating them under
quantization and fine-tuning scenarios.
Ссылки и действия
Дополнительные ресурсы: