CAMEx: Curvature-aware Merging of Experts

Authors: Dung Viet Nguyen, Minh Nguyen, Luc Nguyen, Rachel Teo, Tan Nguyen, Duy Linh Tran

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our contributions are threefold: (1) CAMEx significantly outperforms traditional Euclidean-based expert merging techniques across various natural language processing tasks, leading to enhanced performance during pre-training and fine-tuning; (2) we introduce a dynamic merging architecture that optimizes resource utilization, achieving high performance while reducing computational costs, facilitating efficient scaling of large language models; and (3) we provide both theoretical and empirical evidence to demonstrate the efficiency of our proposed method. ... We empirically demonstrate that 1) our proposed merging method can add in rapidness of convergence speed for pre-training and 2) when combined with other merging protocols, it can boost the model s performance on a variety of practical tasks, including language modeling, text classification, question answering, and image classification.
Researcher Affiliation Collaboration 1Faculty of Mathematics and Informatics, Hanoi University of Science and Technology 2Viettel AI, Viettel Group 3Department of Mathematics, National University of Singapore
Pseudocode Yes Algorithm 1 The Overall Procedures of CAMEx.
Open Source Code Yes The code is publicly available at: https://github.com/kpup1710/CAMEx.
Open Datasets Yes For language modeling, we use the Wikitext-2 and Wikitext-103 (Merity et al., 2016) benchmarks. For text classification, we employ a subset of the GLUE (Wang et al., 2019) benchmark... For question answering, we employ two famous benchmarks: SQu AD (Rajpurkar et al., 2016) and Wiki QA (Yang et al., 2015). Finally, the Image Net-1k (Deng et al., 2009) dataset is chosen for image classification evaluation.
Dataset Splits Yes The Wiki Text-103 dataset consists of Wikipedia articles designed to capture long-range contextual dependencies. The training set includes approximately 28,000 articles, totaling around 103 million words. The validation and test sets have 218,000 and 246,000 words, respectively, spread across 60 articles per set, with each set comprising roughly 268,000 words.
Hardware Specification Yes We choose Adam W (Loshchilov & Hutter, 2019) as the default optimizer and conduct all experiments on NVIDIA A100 GPUs.
Software Dependencies No The paper mentions using AdamW as the optimizer but does not specify versions for other key software components like Python, PyTorch, or CUDA.
Experiment Setup Yes This encompasses batch sizes from{8, 16, 32, 64}, learning rates from{3e 4, 1e 4, 3e 5, 1e 5}, to pinpoint the optimal fine-tuned models. Regarding image classification tasks, a batchsize of 96 was chosen for all models. ... Table 8: Fine-tuning hyper-parameters of all models in Section 3 Hyper-Parameters Values Optimizer ADAMW Adam ϵ 1e 6 Adam β (0.9, 0.98) Warm-up steps 16 Weight decay 0.01 LR scheduler LINEAR DECAY Scaling factor α 1 Kronecker rank r 1