Ensembles of Low-Rank Expert Adapters

Authors: Yinghao Li, Vianne Gao, Chao Zhang, MohamadAli Torkamani

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments show that our method outperforms baseline Lo RA adapters trained on the full dataset and other ensemble approaches with similar training and inference complexity across a range of domain-specific tasks.
Researcher Affiliation Industry 1Amazon Web Service 2Amazon.com EMAIL
Pseudocode No The pipeline, shown in Figure 1, consists of three main steps: 1) full-data adapter tuning, 2) gradient calculation, and 3) clustering and per-cluster fine-tuning. During inference, we estimate the similarity between the gradient of test instructions and the cluster instances to determine the influence of each cluster on the final prediction.
Open Source Code No Available at https://github.com/hendrycks/math. (This refers to third-party code for parsing results and assessing accuracy, not the source code for the ELREA methodology presented in the paper.)
Open Datasets Yes For the first category, following Xia et al. (2024), we employ Flan V2 (Longpre et al., 2023), Co T (Wei et al., 2022), Dolly-15k (Conover et al., 2023), and Open Assistant Conversations (K opf et al., 2023) for fine-tuning, and MMLU (Hendrycks et al., 2021a) and BIG-bench Hard (BBH; bench authors, 2023; Suzgun et al., 2023) to test model performance. In practice, we directly use the pre-processed dataset provided by Xia et al. (2024), which consolidates these datasets into a unified format suitable for fine-tuning.5 (Footnote 5 states: Available at https://huggingface.co/datasets/princeton-nlp/less_data.)
Dataset Splits Yes MATH Hendrycks et al. (2021b) 7,500 & 1,000... GSM8k Cobbe et al. (2021) 7,441 & 1,000... SVAMP Patel et al. (2021) 677 & 280... Math QA Amini et al. (2019) 26,287 & 998... (These numbers in the table indicate training & test instances for the mathematical reasoning datasets). The text also explicitly distinguishes between fine-tuning and test datasets.
Hardware Specification Yes Most fine-tuning sessions are conducted on an computing instance equipped with 8 NVIDIA A100 40GB GPUs... Additional training sessions utilize instances with 8 NVIDIA V100 32GB GPUs... To evaluate the efficiency of ELREA, we compared its computation time with that of the baseline model M+Qbase using a same set of hyper-parameters and device configuration on a single NVIDIA A101 80G GPU...
Software Dependencies No Most fine-tuning sessions are conducted on an computing instance equipped with 8 NVIDIA A100 40GB GPUs, employing 4-bit quantization for the backbone model M and bf16 precision for adapters Q. This setup essentially uses QLo RA (Dettmers et al., 2023) rather than Lo RA... (The paper mentions QLoRA and Sentence Transformers, but does not provide specific version numbers for any software, libraries, or frameworks used beyond the reference to the original papers.)
Experiment Setup Yes Our primary experiments involve fine-tuning the Gemma-2b model (Gemma Team, 2024b), specifically gemma-1.1-2b-it3, by applying rank-8 Lo RA adapters to all linear layers... fine-tune the base adapter Qbase for 2 epochs using the Adam optimizer, with an initial learning rate of 5 10 5 that linearly decays to zero. Cluster-wise adapters Qc are initialized from Qbase and fine-tuned for the same duration with a slightly reduced learning rate of 2 10 5... The maximum token sequence length during training is 2,048, with a batch size equals to 16 sequences distributed across the GPUs. Following Xia et al. (2024), we set the gradient projection dimensionality for clustering dproj to 8,192...