TwIST: Rigging the Lottery in Transformers with Independent Subnetwork Training
2511.03983v1
cs.LG, math.OC
2025-11-08
Авторы:
Michael Menezes, Barbara Su, Xinze Feng, Yehya Farhat, Hamza Shili, Anastasios Kyrillidis
Abstract
We introduce TwIST, a distributed training framework for efficient large
language model (LLM) sparsification. TwIST trains multiple subnetworks in
parallel, periodically aggregates their parameters, and resamples new
subnetworks during training. This process identifies high-quality subnetworks
("golden tickets") without requiring post-training procedures such as
calibration or Hessian-based recovery. As a result, TwIST enables zero-cost
pruning at deployment time while achieving perplexity competitive with
state-of-the-art post-training sparsification methods. The benefits are most
pronounced under aggressive sparsity (e.g., 50%+), where TwIST significantly
outperforms baseline methods; for example, reaching 23.14 PPL compared to 31.64
for the closest prior approach. Unlike unstructured pruning, TwIST produces
structured, dense matrices that offer practical inference speedups and memory
reductions on commodity hardware (e.g., CPUs) that do not support efficient
sparse computation. TwIST provides an efficient training-time path to
deployable sparse LLMs without additional fine-tuning or recovery overhead.
Ссылки и действия
Дополнительные ресурсы: