reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Ensembles of Low-Rank Expert Adapters

Authors: Yinghao Li, Vianne Gao, Chao Zhang, MohamadAli Torkamani

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments show that our method outperforms baseline Lo RA adapters trained on the full dataset and other ensemble approaches with similar training and inference complexity across a range of domain-specific tasks.
Researcher Affiliation	Industry	1Amazon Web Service 2Amazon.com EMAIL
Pseudocode	No	The pipeline, shown in Figure 1, consists of three main steps: 1) full-data adapter tuning, 2) gradient calculation, and 3) clustering and per-cluster fine-tuning. During inference, we estimate the similarity between the gradient of test instructions and the cluster instances to determine the influence of each cluster on the final prediction.
Open Source Code	No	Available at https://github.com/hendrycks/math. (This refers to third-party code for parsing results and assessing accuracy, not the source code for the ELREA methodology presented in the paper.)
Open Datasets	Yes	For the first category, following Xia et al. (2024), we employ Flan V2 (Longpre et al., 2023), Co T (Wei et al., 2022), Dolly-15k (Conover et al., 2023), and Open Assistant Conversations (K opf et al., 2023) for fine-tuning, and MMLU (Hendrycks et al., 2021a) and BIG-bench Hard (BBH; bench authors, 2023; Suzgun et al., 2023) to test model performance. In practice, we directly use the pre-processed dataset provided by Xia et al. (2024), which consolidates these datasets into a unified format suitable for fine-tuning.5 (Footnote 5 states: Available at https://huggingface.co/datasets/princeton-nlp/less_data.)
Dataset Splits	Yes	MATH Hendrycks et al. (2021b) 7,500 & 1,000... GSM8k Cobbe et al. (2021) 7,441 & 1,000... SVAMP Patel et al. (2021) 677 & 280... Math QA Amini et al. (2019) 26,287 & 998... (These numbers in the table indicate training & test instances for the mathematical reasoning datasets). The text also explicitly distinguishes between fine-tuning and test datasets.
Hardware Specification	Yes	Most fine-tuning sessions are conducted on an computing instance equipped with 8 NVIDIA A100 40GB GPUs... Additional training sessions utilize instances with 8 NVIDIA V100 32GB GPUs... To evaluate the efficiency of ELREA, we compared its computation time with that of the baseline model M+Qbase using a same set of hyper-parameters and device configuration on a single NVIDIA A101 80G GPU...
Software Dependencies	No	Most fine-tuning sessions are conducted on an computing instance equipped with 8 NVIDIA A100 40GB GPUs, employing 4-bit quantization for the backbone model M and bf16 precision for adapters Q. This setup essentially uses QLo RA (Dettmers et al., 2023) rather than Lo RA... (The paper mentions QLoRA and Sentence Transformers, but does not provide specific version numbers for any software, libraries, or frameworks used beyond the reference to the original papers.)
Experiment Setup	Yes	Our primary experiments involve fine-tuning the Gemma-2b model (Gemma Team, 2024b), specifically gemma-1.1-2b-it3, by applying rank-8 Lo RA adapters to all linear layers... fine-tune the base adapter Qbase for 2 epochs using the Adam optimizer, with an initial learning rate of 5 10 5 that linearly decays to zero. Cluster-wise adapters Qc are initialized from Qbase and fine-tuned for the same duration with a slightly reduced learning rate of 2 10 5... The maximum token sequence length during training is 2,048, with a batch size equals to 16 sequences distributed across the GPUs. Following Xia et al. (2024), we set the gradient projection dimensionality for clustering dproj to 8,192...