Block Rotation is All You Need for MXFP4 Quantization
2511.04214v1
cs.LG, cs.CL
2025-11-08
Авторы:
Yuantian Shao, Peisong Wang, Yuanteng Chen, Chang Xu, Zhihui Wei, Jian Cheng
Abstract
Large language models (LLMs) have achieved remarkable success, but their
rapidly growing scale imposes prohibitive costs in memory, computation, and
energy. Post-training quantization (PTQ) is a promising solution for efficient
deployment, yet achieving accurate W4A4 quantization remains an open challenge.
While most existing methods are designed for INT4 formats, the emergence of
MXFP4 -- a new FP4 format with various hardware support (NVIDIA, AMD, Intel)--
raises questions about the applicability of current techniques. In this work,
we establish a comprehensive benchmark of PTQ methods under the MXFP4 format.
Through systematic evaluation, we find that methods like GPTQ consistently
deliver strong performance, whereas rotation-based approaches, which are almost
used by all state-of-the-art approaches, suffer from severe incompatibility
with MXFP4. We further provide the first in-depth analysis of this conflict,
tracing its root to a fundamental mismatch between MXFP4's PoT (power-of-two)
block scaling and the redistribution of outlier energy via global rotation.
Building on this insight, we propose a simple yet effective block rotation
strategy that adapts rotation-based methods to MXFP4, leading to substantial
accuracy improvements across diverse LLMs. Our findings not only offer clear
guidance for practitioners but also set a foundation for advancing PTQ research
under emerging low-precision formats.
Ссылки и действия
Дополнительные ресурсы: