Channel Merging: Preserving Specialization for Merged Experts
Authors: Mingyang Zhang, Jing Liu, Ganggui Ding, Linlin Ou, Xinyi Yu, Bohan Zhuang
AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experiments demonstrate that Channel Merging consistently delivers high performance, matching unmerged models in tasks like English and Chinese reasoning, mathematical reasoning, and code generation. Moreover, it obtains results comparable to model ensemble with just 53% parameters when used with a task-specific router. |
| Researcher Affiliation | Academia | Mingyang Zhang1, Jing Liu2, Ganggui Ding1, Linlin Ou3, Xinyi Yu3, Bohan Zhuang1* 1 Zhejiang University 2 Monash University 3 Zhejiang University of Technology |
| Pseudocode | No | The paper describes the methodology using textual explanations and mathematical formulations, such as in the 'Merging with Channel Similarity' section, but it does not present any explicit pseudocode blocks or algorithms labeled as such. |
| Open Source Code | No | The paper references third-party tools like "Merge Kit (Goddard et al. 2024)" and "Open Compass toolbox (Contributors 2023)" used for evaluation and merging algorithms, but it does not explicitly state that the authors' own implementation code for Channel Merging is released or provide a link to a repository for their specific methodology. |
| Open Datasets | Yes | To evaluate the performance of merging, we report accuracy on several benchmarks across different domains: Common Sense QA (Talmor et al. 2019) and Trivia QA (Joshi et al. 2017) for the instruction, GSM8K (Cobbe et al. 2021) and MATH (Hendrycks et al. 2021b) for the mathematics, Human Eval (Chen et al. 2021) and MBPP (Austin et al. 2021) for the code, and CEval (Huang et al. 2024) and CMMLU (Li et al. 2023) for the Chinese. Besides, we evaluate the merged model with the task-specific router on several general task benchmarks: MMLU (Hendrycks et al. 2021a), CMMLU, and AGIEval (Zhong et al. 2024). |
| Dataset Splits | Yes | To train this router efficiently, we operate under the assumption that an expert will perform optimally on queries that originate from its fine-tuning dataset. To implement this, we sample a set of queries Q from the datasets of various tasks, using the originating task classes (e.g., code, math, instruction, Chinese) as the label Y. The optimization process for training the router is then defined as follows: Z = argmin Z (q,y) (Q,Y ) y log(Z(q, m)) (6) We use the Open Compass toolbox (Contributors 2023) to evaluate all datasets. |
| Hardware Specification | Yes | The merging experiments can be done on only a single A100 GPU. |
| Software Dependencies | No | The paper mentions "Merge Kit (Goddard et al. 2024)" and "Open Compass toolbox (Contributors 2023)" as tools used, but it does not specify version numbers for these or other crucial software components like programming languages or libraries (e.g., Python, PyTorch, CUDA versions) that would be needed for reproducibility. |
| Experiment Setup | Yes | For model merging, we cluster the expert weights into several groups. Subsequently, we use the commonly used model merging algorithms from Merge Kit (Goddard et al. 2024) to merge the parameters in the same group: (1) DARE-CM, we randomly prune 30% of the delta parameters for each expert before merging. (2) TIES-CM, we prune 30% of the delta parameters based on their magnitude and sign for each expert before merging. λ in Eq. (3) is set to 0.5. Unless otherwise specified, we define the number of groups as two. |