Rescaling-Aware Training for Efficient Deployment of Deep Learning Models on Full-Integer Hardware

2510.11484v1 cs.LG, cs.AR 2025-10-15

Авторы:

Lion Mueller, Alberto Garcia-Ortiz, Ardalan Najafi, Adam Fuks, Lennart Bamberg

Abstract

Integer AI inference significantly reduces computational complexity in embedded systems. Quantization-aware training (QAT) helps mitigate accuracy degradation associated with post-training quantization but still overlooks the impact of integer rescaling during inference, which is a hardware costly operation in integer-only AI inference. This work shows that rescaling cost can be dramatically reduced post-training, by applying a stronger quantization to the rescale multiplicands at no model-quality loss. Furthermore, we introduce Rescale-Aware Training, a fine tuning method for ultra-low bit-width rescaling multiplicands. Experiments show that even with 8x reduced rescaler widths, the full accuracy is preserved through minimal incremental retraining. This enables more energy-efficient and cost-efficient AI inference for resource-constrained embedded systems.

Ссылки и действия

Читать на arXiv Скачать PDF

Дополнительные ресурсы:

Rescaling-Aware Training for Efficient Deployment of Deep Learning Models on Full-Integer Hardware

Авторы:

Abstract

Ссылки и действия

Связанные статьи

ESACT: An End-to-End Sparse Accelerator for Compute-Intensive Transformers via L...

Context-Aware Mixture-of-Experts Inference on CXL-Enabled GPU-NDP Systems

ESACT: An End-to-End Sparse Accelerator for Compute-Intensive Transformers via L...

ZeroSim: Zero-Shot Analog Circuit Evaluation with Unified Transformer Embeddings

SLOFetch: Compressed-Hierarchical Instruction Prefetching for Cloud Microservice...

Навигация