MxMoE: Mixed-precision Quantization for MoE with Accuracy and Performance Co-Design
Authors: Haojie Duanmu, Xiuhong Li, Zhihang Yuan, Size Zheng, Jiangfei Duan, Xingcheng Zhang, Dahua Lin
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Evaluations show that Mx Mo E outperforms existing methods, achieving 2.4 lower Wikitext-2 perplexity than GPTQ at 2.25-bit and delivering up to 3.4 speedup over full precision, as well as up to 29.4% speedup over uniform quantization at equivalent accuracy with 5-bit weight-activation quantization. |
| Researcher Affiliation | Collaboration | 1Shanghai Jiao Tong University 2Shanghai AI Laboratory 3Peking University 4Byte Dance Seed 5The Chinese University of Hong Kong 6CPII under Inno HK. |
| Pseudocode | No | The paper describes the methodology in prose and mathematical formulas, but does not contain any clearly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | Our code is available at https://github.com/cat538/Mx Mo E. |
| Open Datasets | Yes | For all experiments, we use 128 sequences, each of length 4096, drawn from the Wikitext2 training set (Merity et al., 2016). This calibration process typically takes from several minutes to a few hours depending on the model size. |
| Dataset Splits | No | The paper mentions using 128 sequences from the Wikitext2 training set for calibration and randomly sampling sequences from Wiki Text-2 for performance analysis, but it does not provide specific training/validation/test splits for the models or benchmark tasks evaluated. |
| Hardware Specification | Yes | Experiments conducted on Nvidia RTX-4090. |
| Software Dependencies | No | The paper mentions several software components like CUTLASS, HQQ, VLLM-Marlin-MoE, Marlin, GPTQ, and CUDA, but does not provide specific version numbers for any of them. |
| Experiment Setup | Yes | For weight-only quantization we test 3-bit and 2-bit quantization, comparing with GPTQ, configured with group size 128, asymmetric min-max quantization, where the scale and zero-point are stored in 16-bit format, resulting in an average bitwidth of 3.25 and 2.25, respectively. ... Mx Mo E use r = 1 as extremely low-bitwidth implies resource-constrained environment... Mx Mo E use r = 0.75. |