Differentiable, Bit-shifting, and Scalable Quantization without training neural network from scratch
2510.16088v1
cs.CV, cs.LG, stat.ML
2025-10-22
Авторы:
Zia Badar
Abstract
Quantization of neural networks provides benefits of inference in less
compute and memory requirements. Previous work in quantization lack two
important aspects which this work provides. First almost all previous work in
quantization used a non-differentiable approach and for learning; the
derivative is usually set manually in backpropogation which make the learning
ability of algorithm questionable, our approach is not just differentiable, we
also provide proof of convergence of our approach to the optimal neural
network. Second previous work in shift/logrithmic quantization either have
avoided activation quantization along with weight quantization or achieved less
accuracy. Learning logrithmic quantize values of form $2^n$ requires the
quantization function can scale to more than 1 bit quantization which is
another benifit of our quantization that it provides $n$ bits quantization as
well. Our approach when tested with image classification task using imagenet
dataset, resnet18 and weight quantization only achieves less than 1 percent
accuracy compared to full precision accuracy while taking only 15 epochs to
train using shift bit quantization and achieves comparable to SOTA approaches
accuracy in both weight and activation quantization using shift bit
quantization in 15 training epochs with slightly higher(only higher cpu
instructions) inference cost compared to 1 bit quantization(without logrithmic
quantization) and not requiring any higher precision multiplication.
Ссылки и действия
Дополнительные ресурсы: