Retraining-free Merging of Sparse MoE via Hierarchical Clustering
Authors: I-Chun Chen, Hsu-Shen Liu, Wei-Fang Sun, Chen-Hao Chao, Yen-Chang Hsu, Chun-Yi Lee
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We provide theoretical analysis and comprehensive evaluations across multiple zero-shot language tasks to demonstrate HC-SMo E s effectiveness in state-of-the-art models including Qwen and Mixtral. The experimental results validate HC-SMo E s superior performance and practical applicability for real-world deployments. |
| Researcher Affiliation | Collaboration | 1Department of Computer Science, National Tsing Hua University, Taiwan 2NVIDIA AI Technology Center (NVAITC) 3Department of Computer Science, University of Toronto, Canada 4Samsung Research America 5Department of Computer Science and Information Engineering, National Taiwan University, Taiwan. Correspondence to: Chun-Yi Lee <EMAIL>. |
| Pseudocode | Yes | Algorithm 1 HC-SMo E: Hierarchical Clustering for Sparse Mixture-of-Experts |
| Open Source Code | Yes | Our implementation is available at https://github.com/wazenmai/HC-SMo E. |
| Open Datasets | Yes | For a fair comparison with previous work, we follow standard evaluation protocols with clustering and merging on the C4 dataset (Raffel et al., 2020) and evaluate accuracy across eight zero-shot language tasks (Lu et al., 2024). Our extensive evaluations in the supplementary material further demonstrate HC-SMo E s effectiveness across diverse dataset domains and tasks. ... All baselines and HC-SMo E require a calibration dataset to estimate input statistics. This dataset is constructed by sampling from the C4 corpus (Raffel et al., 2020), concatenating extracted text into 32 sequences of 2, 048 tokens each. To further validate the independence of HC-SMo E from the calibration dataset, we construct two additional datasets from MATH (Hendrycks et al., 2021b) and Code QA (Liu & Wan, 2021). |
| Dataset Splits | Yes | We report zero-shot accuracy on those benchmarks. ... All baselines and HC-SMo E require a calibration dataset to estimate input statistics. This dataset is constructed by sampling from the C4 corpus (Raffel et al., 2020), concatenating extracted text into 32 sequences of 2, 048 tokens each. ... Our experiments utilize a two-shot prompt format in Table 14. The evaluation protocol assesses whether the model outputs {A, B, C, D} in the subsequent three tokens. The experimental methodology employs the Med MCQA validation set for evaluation and its training set for calibration. |
| Hardware Specification | Yes | Experiments on Mixtral 8x7B and Qwen are conducted on eight NVIDIA A100 GPUs and four NVIDIA V100 GPUs, respectively. |
| Software Dependencies | No | The paper only mentions software by name (e.g., Eleuther AI Language Model Evaluation Harness) without specifying version numbers for any libraries or frameworks. |
| Experiment Setup | Yes | We conduct experiments on two SMo E models: Qwen1.5-Mo E-A2.7B (henceforth Qwen) (Team, 2024) and Mixtral 8x7B (Jiang et al., 2024). For Qwen, we explore two levels of reduction: merging the number of experts from 60 to 45 and further to 30 per layer. This corresponds to a reduction in parameters from 14.3B to 11.2B (denoted as Qwen 45x2.7B), and subsequently to 8.1B (denoted as Qwen 30x2.7B). Similarly, Mixtral 8x7B undergoes reduction from eight to six experts and then to four experts per layer... To evaluate our method in a task-agnostic setting, we utilize eight tasks using the Eleuther AI Language Model Evaluation Harness (Gao et al., 2024)... We report zero-shot accuracy on those benchmarks. ... All baselines and HC-SMo E require a calibration dataset to estimate input statistics. This dataset is constructed by sampling from the C4 corpus (Raffel et al., 2020), concatenating extracted text into 32 sequences of 2, 048 tokens each. |