reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Retraining-free Merging of Sparse MoE via Hierarchical Clustering

Authors: I-Chun Chen, Hsu-Shen Liu, Wei-Fang Sun, Chen-Hao Chao, Yen-Chang Hsu, Chun-Yi Lee

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We provide theoretical analysis and comprehensive evaluations across multiple zero-shot language tasks to demonstrate HC-SMo E s effectiveness in state-of-the-art models including Qwen and Mixtral. The experimental results validate HC-SMo E s superior performance and practical applicability for real-world deployments.
Researcher Affiliation	Collaboration	1Department of Computer Science, National Tsing Hua University, Taiwan 2NVIDIA AI Technology Center (NVAITC) 3Department of Computer Science, University of Toronto, Canada 4Samsung Research America 5Department of Computer Science and Information Engineering, National Taiwan University, Taiwan. Correspondence to: Chun-Yi Lee <EMAIL>.
Pseudocode	Yes	Algorithm 1 HC-SMo E: Hierarchical Clustering for Sparse Mixture-of-Experts
Open Source Code	Yes	Our implementation is available at https://github.com/wazenmai/HC-SMo E.
Open Datasets	Yes	For a fair comparison with previous work, we follow standard evaluation protocols with clustering and merging on the C4 dataset (Raffel et al., 2020) and evaluate accuracy across eight zero-shot language tasks (Lu et al., 2024). Our extensive evaluations in the supplementary material further demonstrate HC-SMo E s effectiveness across diverse dataset domains and tasks. ... All baselines and HC-SMo E require a calibration dataset to estimate input statistics. This dataset is constructed by sampling from the C4 corpus (Raffel et al., 2020), concatenating extracted text into 32 sequences of 2, 048 tokens each. To further validate the independence of HC-SMo E from the calibration dataset, we construct two additional datasets from MATH (Hendrycks et al., 2021b) and Code QA (Liu & Wan, 2021).
Dataset Splits	Yes	We report zero-shot accuracy on those benchmarks. ... All baselines and HC-SMo E require a calibration dataset to estimate input statistics. This dataset is constructed by sampling from the C4 corpus (Raffel et al., 2020), concatenating extracted text into 32 sequences of 2, 048 tokens each. ... Our experiments utilize a two-shot prompt format in Table 14. The evaluation protocol assesses whether the model outputs {A, B, C, D} in the subsequent three tokens. The experimental methodology employs the Med MCQA validation set for evaluation and its training set for calibration.
Hardware Specification	Yes	Experiments on Mixtral 8x7B and Qwen are conducted on eight NVIDIA A100 GPUs and four NVIDIA V100 GPUs, respectively.
Software Dependencies	No	The paper only mentions software by name (e.g., Eleuther AI Language Model Evaluation Harness) without specifying version numbers for any libraries or frameworks.
Experiment Setup	Yes	We conduct experiments on two SMo E models: Qwen1.5-Mo E-A2.7B (henceforth Qwen) (Team, 2024) and Mixtral 8x7B (Jiang et al., 2024). For Qwen, we explore two levels of reduction: merging the number of experts from 60 to 45 and further to 30 per layer. This corresponds to a reduction in parameters from 14.3B to 11.2B (denoted as Qwen 45x2.7B), and subsequently to 8.1B (denoted as Qwen 30x2.7B). Similarly, Mixtral 8x7B undergoes reduction from eight to six experts and then to four experts per layer... To evaluate our method in a task-agnostic setting, we utilize eight tasks using the Eleuther AI Language Model Evaluation Harness (Gao et al., 2024)... We report zero-shot accuracy on those benchmarks. ... All baselines and HC-SMo E require a calibration dataset to estimate input statistics. This dataset is constructed by sampling from the C4 corpus (Raffel et al., 2020), concatenating extracted text into 32 sequences of 2, 048 tokens each.