DartQuant: Efficient Rotational Distribution Calibration for LLM Quantization
2511.04063v1
cs.LG, cs.CL
2025-11-08
Авторы:
Yuantian Shao, Yuanteng Chen, Peisong Wang, Jianlin Yu, Jing Lin, Yiwu Yao, Zhihui Wei, Jian Cheng
Abstract
Quantization plays a crucial role in accelerating the inference of
large-scale models, and rotational matrices have been shown to effectively
improve quantization performance by smoothing outliers. However, end-to-end
fine-tuning of rotational optimization algorithms incurs high computational
costs and is prone to overfitting. To address this challenge, we propose an
efficient distribution-aware rotational calibration method, DartQuant, which
reduces the complexity of rotational optimization by constraining the
distribution of the activations after rotation. This approach also effectively
reduces reliance on task-specific losses, thereby mitigating the risk of
overfitting. Additionally, we introduce the QR-Orth optimization scheme, which
replaces expensive alternating optimization with a more efficient solution. In
a variety of model quantization experiments, DartQuant demonstrates superior
performance. Compared to existing methods, it achieves 47$\times$ acceleration
and 10$\times$ memory savings for rotational optimization on a 70B model.
Furthermore, it is the first to successfully complete rotational calibration
for a 70B model on a single 3090 GPU, making quantization of large language
models feasible in resource-constrained environments. Code is available at
https://github.com/CAS-CLab/DartQuant.git.
Ссылки и действия
Дополнительные ресурсы: