reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

CAMEx: Curvature-aware Merging of Experts

Authors: Dung Viet Nguyen, Minh Nguyen, Luc Nguyen, Rachel Teo, Tan Nguyen, Duy Linh Tran

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our contributions are threefold: (1) CAMEx significantly outperforms traditional Euclidean-based expert merging techniques across various natural language processing tasks, leading to enhanced performance during pre-training and fine-tuning; (2) we introduce a dynamic merging architecture that optimizes resource utilization, achieving high performance while reducing computational costs, facilitating efficient scaling of large language models; and (3) we provide both theoretical and empirical evidence to demonstrate the efficiency of our proposed method. ... We empirically demonstrate that 1) our proposed merging method can add in rapidness of convergence speed for pre-training and 2) when combined with other merging protocols, it can boost the model s performance on a variety of practical tasks, including language modeling, text classification, question answering, and image classification.
Researcher Affiliation	Collaboration	1Faculty of Mathematics and Informatics, Hanoi University of Science and Technology 2Viettel AI, Viettel Group 3Department of Mathematics, National University of Singapore
Pseudocode	Yes	Algorithm 1 The Overall Procedures of CAMEx.
Open Source Code	Yes	The code is publicly available at: https://github.com/kpup1710/CAMEx.
Open Datasets	Yes	For language modeling, we use the Wikitext-2 and Wikitext-103 (Merity et al., 2016) benchmarks. For text classification, we employ a subset of the GLUE (Wang et al., 2019) benchmark... For question answering, we employ two famous benchmarks: SQu AD (Rajpurkar et al., 2016) and Wiki QA (Yang et al., 2015). Finally, the Image Net-1k (Deng et al., 2009) dataset is chosen for image classification evaluation.
Dataset Splits	Yes	The Wiki Text-103 dataset consists of Wikipedia articles designed to capture long-range contextual dependencies. The training set includes approximately 28,000 articles, totaling around 103 million words. The validation and test sets have 218,000 and 246,000 words, respectively, spread across 60 articles per set, with each set comprising roughly 268,000 words.
Hardware Specification	Yes	We choose Adam W (Loshchilov & Hutter, 2019) as the default optimizer and conduct all experiments on NVIDIA A100 GPUs.
Software Dependencies	No	The paper mentions using AdamW as the optimizer but does not specify versions for other key software components like Python, PyTorch, or CUDA.
Experiment Setup	Yes	This encompasses batch sizes from{8, 16, 32, 64}, learning rates from{3e 4, 1e 4, 3e 5, 1e 5}, to pinpoint the optimal fine-tuned models. Regarding image classification tasks, a batchsize of 96 was chosen for all models. ... Table 8: Fine-tuning hyper-parameters of all models in Section 3 Hyper-Parameters Values Optimizer ADAMW Adam ϵ 1e 6 Adam β (0.9, 0.98) Warm-up steps 16 Weight decay 0.01 LR scheduler LINEAR DECAY Scaling factor α 1 Kronecker rank r 1