reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Channel Merging: Preserving Specialization for Merged Experts

Authors: Mingyang Zhang, Jing Liu, Ganggui Ding, Linlin Ou, Xinyi Yu, Bohan Zhuang

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our experiments demonstrate that Channel Merging consistently delivers high performance, matching unmerged models in tasks like English and Chinese reasoning, mathematical reasoning, and code generation. Moreover, it obtains results comparable to model ensemble with just 53% parameters when used with a task-specific router.
Researcher Affiliation	Academia	Mingyang Zhang1, Jing Liu2, Ganggui Ding1, Linlin Ou3, Xinyi Yu3, Bohan Zhuang1* 1 Zhejiang University 2 Monash University 3 Zhejiang University of Technology
Pseudocode	No	The paper describes the methodology using textual explanations and mathematical formulations, such as in the 'Merging with Channel Similarity' section, but it does not present any explicit pseudocode blocks or algorithms labeled as such.
Open Source Code	No	The paper references third-party tools like "Merge Kit (Goddard et al. 2024)" and "Open Compass toolbox (Contributors 2023)" used for evaluation and merging algorithms, but it does not explicitly state that the authors' own implementation code for Channel Merging is released or provide a link to a repository for their specific methodology.
Open Datasets	Yes	To evaluate the performance of merging, we report accuracy on several benchmarks across different domains: Common Sense QA (Talmor et al. 2019) and Trivia QA (Joshi et al. 2017) for the instruction, GSM8K (Cobbe et al. 2021) and MATH (Hendrycks et al. 2021b) for the mathematics, Human Eval (Chen et al. 2021) and MBPP (Austin et al. 2021) for the code, and CEval (Huang et al. 2024) and CMMLU (Li et al. 2023) for the Chinese. Besides, we evaluate the merged model with the task-specific router on several general task benchmarks: MMLU (Hendrycks et al. 2021a), CMMLU, and AGIEval (Zhong et al. 2024).
Dataset Splits	Yes	To train this router efficiently, we operate under the assumption that an expert will perform optimally on queries that originate from its fine-tuning dataset. To implement this, we sample a set of queries Q from the datasets of various tasks, using the originating task classes (e.g., code, math, instruction, Chinese) as the label Y. The optimization process for training the router is then defined as follows: Z = argmin Z (q,y) (Q,Y ) y log(Z(q, m)) (6) We use the Open Compass toolbox (Contributors 2023) to evaluate all datasets.
Hardware Specification	Yes	The merging experiments can be done on only a single A100 GPU.
Software Dependencies	No	The paper mentions "Merge Kit (Goddard et al. 2024)" and "Open Compass toolbox (Contributors 2023)" as tools used, but it does not specify version numbers for these or other crucial software components like programming languages or libraries (e.g., Python, PyTorch, CUDA versions) that would be needed for reproducibility.
Experiment Setup	Yes	For model merging, we cluster the expert weights into several groups. Subsequently, we use the commonly used model merging algorithms from Merge Kit (Goddard et al. 2024) to merge the parameters in the same group: (1) DARE-CM, we randomly prune 30% of the delta parameters for each expert before merging. (2) TIES-CM, we prune 30% of the delta parameters based on their magnitude and sign for each expert before merging. λ in Eq. (3) is set to 0.5. Unless otherwise specified, we define the number of groups as two.