Retraining-free Merging of Sparse MoE via Hierarchical Clustering

Authors: I-Chun Chen, Hsu-Shen Liu, Wei-Fang Sun, Chen-Hao Chao, Yen-Chang Hsu, Chun-Yi Lee

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We provide theoretical analysis and comprehensive evaluations across multiple zero-shot language tasks to demonstrate HC-SMo E s effectiveness in state-of-the-art models including Qwen and Mixtral. The experimental results validate HC-SMo E s superior performance and practical applicability for real-world deployments.
Researcher Affiliation Collaboration 1Department of Computer Science, National Tsing Hua University, Taiwan 2NVIDIA AI Technology Center (NVAITC) 3Department of Computer Science, University of Toronto, Canada 4Samsung Research America 5Department of Computer Science and Information Engineering, National Taiwan University, Taiwan. Correspondence to: Chun-Yi Lee <EMAIL>.
Pseudocode Yes Algorithm 1 HC-SMo E: Hierarchical Clustering for Sparse Mixture-of-Experts
Open Source Code Yes Our implementation is available at https://github.com/wazenmai/HC-SMo E.
Open Datasets Yes For a fair comparison with previous work, we follow standard evaluation protocols with clustering and merging on the C4 dataset (Raffel et al., 2020) and evaluate accuracy across eight zero-shot language tasks (Lu et al., 2024). Our extensive evaluations in the supplementary material further demonstrate HC-SMo E s effectiveness across diverse dataset domains and tasks. ... All baselines and HC-SMo E require a calibration dataset to estimate input statistics. This dataset is constructed by sampling from the C4 corpus (Raffel et al., 2020), concatenating extracted text into 32 sequences of 2, 048 tokens each. To further validate the independence of HC-SMo E from the calibration dataset, we construct two additional datasets from MATH (Hendrycks et al., 2021b) and Code QA (Liu & Wan, 2021).
Dataset Splits Yes We report zero-shot accuracy on those benchmarks. ... All baselines and HC-SMo E require a calibration dataset to estimate input statistics. This dataset is constructed by sampling from the C4 corpus (Raffel et al., 2020), concatenating extracted text into 32 sequences of 2, 048 tokens each. ... Our experiments utilize a two-shot prompt format in Table 14. The evaluation protocol assesses whether the model outputs {A, B, C, D} in the subsequent three tokens. The experimental methodology employs the Med MCQA validation set for evaluation and its training set for calibration.
Hardware Specification Yes Experiments on Mixtral 8x7B and Qwen are conducted on eight NVIDIA A100 GPUs and four NVIDIA V100 GPUs, respectively.
Software Dependencies No The paper only mentions software by name (e.g., Eleuther AI Language Model Evaluation Harness) without specifying version numbers for any libraries or frameworks.
Experiment Setup Yes We conduct experiments on two SMo E models: Qwen1.5-Mo E-A2.7B (henceforth Qwen) (Team, 2024) and Mixtral 8x7B (Jiang et al., 2024). For Qwen, we explore two levels of reduction: merging the number of experts from 60 to 45 and further to 30 per layer. This corresponds to a reduction in parameters from 14.3B to 11.2B (denoted as Qwen 45x2.7B), and subsequently to 8.1B (denoted as Qwen 30x2.7B). Similarly, Mixtral 8x7B undergoes reduction from eight to six experts and then to four experts per layer... To evaluate our method in a task-agnostic setting, we utilize eight tasks using the Eleuther AI Language Model Evaluation Harness (Gao et al., 2024)... We report zero-shot accuracy on those benchmarks. ... All baselines and HC-SMo E require a calibration dataset to estimate input statistics. This dataset is constructed by sampling from the C4 corpus (Raffel et al., 2020), concatenating extracted text into 32 sequences of 2, 048 tokens each.