reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Mixture Compressor for Mixture-of-Experts LLMs Gains More

Authors: Wei Huang, Yue Liao, Jianhui Liu, Ruifei He, Haoru Tan, Shiming Zhang, Hongsheng Li, Si Liu, XIAOJUAN QI

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments confirm the effectiveness of our approach. For instance, at 2.54 bits, MC compresses 76.6% of the model, with only a 3.8% average accuracy loss in eight commonsense benchmarks. During dynamic inference, we further reduce activated parameters by 15%, with a performance drop of less than 0.6%. Remarkably, MC even surpasses floating-point 13b dense LLMs with significantly smaller parameter sizes, suggesting that mixture compression in Mo E-LLMs has the potential to outperform both comparable and larger dense LLMs.
Researcher Affiliation	Academia	1The University of Hong Kong 2The Chinese University of Hong Kong 3Beihang University 4Centre for Perceptual and Interactive Intelligence, Hong Kong
Pseudocode	No	The paper describes methods like Pre-Loading Mixed-Precision Quantization (PMQ) and Online Dynamic Pruning (ODP) but does not present them in structured pseudocode or algorithm blocks.
Open Source Code	Yes	Our code is available at https://github.com/Aaronhuang-778/MC-Mo E.
Open Datasets	Yes	The mixed-precision factors of experts are calibrated from C4 (Raffel et al., 2020) dataset, with 128 sets of random sequences, each 2048 tokens long. In the performance experiments for the proposed MC, perplexity (PPL ) was chosen as the metric to evaluate token prediction capabilities, primarily deploying the general text dataset Wiki Text2. To comprehensively assess the language capabilities of the compressed LLMs, we evaluated the models overall abilities in eight zero-shot benchmarks ( ) tested by Eleuther AI LM Harness (Gao et al., 2013).
Dataset Splits	Yes	The mixed-precision factors of experts are calibrated from C4 (Raffel et al., 2020) dataset, with 128 sets of random sequences, each 2048 tokens long. To comprehensively assess the language capabilities of the compressed LLMs, we evaluated the models overall abilities in eight zero-shot benchmarks ( ) tested by Eleuther AI LM Harness (Gao et al., 2013).
Hardware Specification	Yes	Mixtral 8 7b can be compressed on two NVIDIA A100-80GB GPUs, while Mixtral 8 22b is completed on four NVIDIA A100-80GB GPUs. 16-bit Mixtral 8 7b uses 2 A100-80GB GPUs and Mixtral 8 22b uses 4, quantized models are tested on one A100-80GB GPU. Mixtral 8 7b MC 2.54-bit 1 3090.
Software Dependencies	No	After determining the bit-width configuration, the final quantization process follows the GPTQ (Frantar et al., 2022) procedure. We utilize the HQQ (Badri & Shaji, 2024) tool to save quantized weights and handle dequantization. While specific tools are mentioned, version numbers are not provided.
Experiment Setup	Yes	The mixed-precision factors of experts are calibrated from C4 (Raffel et al., 2020) dataset, with 128 sets of random sequences, each 2048 tokens long. Our goal is to ensure that the extremely-low average bit-width across all experts in a Mo E block equals a targeted value k, with bit-width options restricted to {1, 2, 3}-bit. The pruning process follows: {w0 = 0, w1 = 1, w0, w1 Top-2{G(t)} \| w1 w0 < µ} (5) w0 and w1 denote the top-2 experts, respectively, with µ erving as a hyperparameter threshold for each Mo E layer. This threshold is set at the median value of w1 w0 derived from calibration data (Lu et al., 2024).