Mixture Compressor for Mixture-of-Experts LLMs Gains More

Authors: Wei Huang, Yue Liao, Jianhui Liu, Ruifei He, Haoru Tan, Shiming Zhang, Hongsheng Li, Si Liu, XIAOJUAN QI

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments confirm the effectiveness of our approach. For instance, at 2.54 bits, MC compresses 76.6% of the model, with only a 3.8% average accuracy loss in eight commonsense benchmarks. During dynamic inference, we further reduce activated parameters by 15%, with a performance drop of less than 0.6%. Remarkably, MC even surpasses floating-point 13b dense LLMs with significantly smaller parameter sizes, suggesting that mixture compression in Mo E-LLMs has the potential to outperform both comparable and larger dense LLMs.
Researcher Affiliation Academia 1The University of Hong Kong 2The Chinese University of Hong Kong 3Beihang University 4Centre for Perceptual and Interactive Intelligence, Hong Kong
Pseudocode No The paper describes methods like Pre-Loading Mixed-Precision Quantization (PMQ) and Online Dynamic Pruning (ODP) but does not present them in structured pseudocode or algorithm blocks.
Open Source Code Yes Our code is available at https://github.com/Aaronhuang-778/MC-Mo E.
Open Datasets Yes The mixed-precision factors of experts are calibrated from C4 (Raffel et al., 2020) dataset, with 128 sets of random sequences, each 2048 tokens long. In the performance experiments for the proposed MC, perplexity (PPL ) was chosen as the metric to evaluate token prediction capabilities, primarily deploying the general text dataset Wiki Text2. To comprehensively assess the language capabilities of the compressed LLMs, we evaluated the models overall abilities in eight zero-shot benchmarks ( ) tested by Eleuther AI LM Harness (Gao et al., 2013).
Dataset Splits Yes The mixed-precision factors of experts are calibrated from C4 (Raffel et al., 2020) dataset, with 128 sets of random sequences, each 2048 tokens long. To comprehensively assess the language capabilities of the compressed LLMs, we evaluated the models overall abilities in eight zero-shot benchmarks ( ) tested by Eleuther AI LM Harness (Gao et al., 2013).
Hardware Specification Yes Mixtral 8 7b can be compressed on two NVIDIA A100-80GB GPUs, while Mixtral 8 22b is completed on four NVIDIA A100-80GB GPUs. 16-bit Mixtral 8 7b uses 2 A100-80GB GPUs and Mixtral 8 22b uses 4, quantized models are tested on one A100-80GB GPU. Mixtral 8 7b MC 2.54-bit 1 3090.
Software Dependencies No After determining the bit-width configuration, the final quantization process follows the GPTQ (Frantar et al., 2022) procedure. We utilize the HQQ (Badri & Shaji, 2024) tool to save quantized weights and handle dequantization. While specific tools are mentioned, version numbers are not provided.
Experiment Setup Yes The mixed-precision factors of experts are calibrated from C4 (Raffel et al., 2020) dataset, with 128 sets of random sequences, each 2048 tokens long. Our goal is to ensure that the extremely-low average bit-width across all experts in a Mo E block equals a targeted value k, with bit-width options restricted to {1, 2, 3}-bit. The pruning process follows: {w0 = 0, w1 = 1, w0, w1 Top-2{G(t)} | w1 w0 < µ} (5) w0 and w1 denote the top-2 experts, respectively, with µ erving as a hyperparameter threshold for each Mo E layer. This threshold is set at the median value of w1 w0 derived from calibration data (Lu et al., 2024).