MicroMix: Efficient Mixed-Precision Quantization with Microscaling Formats for Large Language Models

2508.02343v1 cs.LG, cs.AI 2025-08-09

Авторы:

Wenyuan Liu, Haoqian Meng, Yilun Luo, Peng Zhang, Xindian Ma

Резюме на русском

Отрицательный воздействие нежелательных нишей в регулируемых рынках цены сильно влияет на эффективность рыночного механизма. Одним из ключевых направлений выявления таких ниш является оценка степени аномалий в ценовых динамиках. В статье предложен метод, основанный на машинном обучении, позволяющий определять аномалии в ценовых данных и выделять ниши, которые могут привести к неэффективности рынка. В качестве основных показателей используются разброс цен, темпы изменения и дисперсия. Результаты применения метода к реальным данным показали, что он эффективно выделяет ниши и определяет их влияние на рыночную эффективность. Этот подход может быть применен для мониторинга рыночных условий, определения неоптимальных ниш и регулирования цен.

Abstract

Quantization significantly accelerates inference in large language models (LLMs) by replacing original high-precision matrices with low-precision counterparts. Recent advances in weight-activation quantization have primarily focused on mapping both weights and activations to the INT4 format. Although the new FP4 Tensor Cores in NVIDIA's Blackwell architecture offer up to 4x speedup over FP16, existing INT4-based kernels fail to fully exploit this capability due to mismatched data formats. To bridge this gap, we propose MicroMix, a co-designed mixed-precision quantization algorithm and matrix multiplication kernel based on Microscaling (MX) data formats. Tailored for the Blackwell architecture, the MicroMix kernel supports arbitrary combinations of MXFP4, MXFP6, and MXFP8 channels, and produces BFloat16 outputs. To achieve a favorable trade-off between accuracy and efficiency for each linear layer, we introduce quantization thresholds that identify activation elements where lower-precision formats (MXFP4 or MXFP6) incur excessive quantization error. Our algorithm selectively allocates higher-precision channels to preserve accuracy while maintaining compute efficiency. MicroMix achieves competitive or superior performance across diverse downstream tasks, including zero-shot and few-shot learning, language modeling, code generation, and mathematical reasoning. On both consumer-grade (RTX 5070Ti laptop) and server-grade (RTX 5090) GPUs, our kernel delivers at least 20% faster execution than TensorRT-FP8. Furthermore, when applied to various Llama and Qwen models, MicroMix consistently improves prefill latency and memory efficiency across a range of batch sizes compared to TensorRT baselines. Our code is available at https://github.com/lwy2020/MicroMix.

Ссылки и действия

Читать на arXiv Скачать PDF

Дополнительные ресурсы:

MicroMix: Efficient Mixed-Precision Quantization with Microscaling Formats for Large Language Models

Авторы:

Резюме на русском

Abstract

Ссылки и действия

Связанные статьи

Prototype-Based Semantic Consistency Alignment for Domain Adaptive Retrieval

Diffusion Fine-Tuning via Reparameterized Policy Gradient of the Soft Q-Function

TimesNet-Gen: Deep Learning-based Site Specific Strong Motion Generation

Realizable Abstractions: Near-Optimal Hierarchical Reinforcement Learning

BEP: A Binary Error Propagation Algorithm for Binary Neural Networks Training

Навигация