MoS: Unleashing Parameter Efficiency of Low-Rank Adaptation with Mixture of Shards

Authors: Sheng Wang, Liheng Chen, Pengan CHEN, Jingwei DONG, Boyang XUE, Jiyue Jiang, Lingpeng Kong, Chuan Wu

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our empirical experiments demonstrate approximately 8 parameter savings in a standard Lo RA setting. The ablation study confirms the significance of each component. Our insights into parameter sharing and MOS method may illuminate future developments of more parameter-efficient finetuning methods.
Researcher Affiliation Academia Sheng Wang , Liheng Chen , Pengan Chen School of Computing and Data Science The University of Hong Kong EMAIL Jingwei Dong School of Electrical Engineering Xi an Jiaotong University EMAIL Boyang Xue, Jiyue Jiang Department of Systems Engineering and Engineering Management The Chinese University of Hong Kong EMAIL, EMAIL Lingpeng Kong, Chuan Wu School of Computing and Data Science The University of Hong Kong EMAIL
Pseudocode No The paper describes methods and formulations using mathematical equations and textual explanations, but it does not include any clearly labeled pseudocode or algorithm blocks.
Open Source Code Yes The code is officially available at https://github.com/Forence1999/Mo S.
Open Datasets Yes We finetune base models on Super-Natural Instructions (Super NI (Wang et al., 2022)) dataset, and evaluate them on both Massive Multitask Language Understanding (MMLU (Hendrycks et al., 2021)) and Ty Di QA (Clark et al., 2020) datasets for factual knowledge and multilingual capabilities, respectively. For the general and mathematical reasoning abilities, we finetune models on Flan V2 and its Co T split (Longpre et al., 2023), and conduct evaluation on Big-Bench-Hard (BBH (Suzgun et al., 2022)) and the test set of Grade School Math (GSM8K (Cobbe et al., 2021)) corpora, respectively. Moreover, coding skills are evaluated on Human Eval (Chen et al., 2021) dataset after models are finetuned on Code Alpaca (Chaudhary, 2023) dataset.
Dataset Splits Yes We finetune base models on Super-Natural Instructions (Super NI (Wang et al., 2022)) dataset, and evaluate them on both Massive Multitask Language Understanding (MMLU (Hendrycks et al., 2021)) and Ty Di QA (Clark et al., 2020) datasets for factual knowledge and multilingual capabilities, respectively. For the general and mathematical reasoning abilities, we finetune models on Flan V2 and its Co T split (Longpre et al., 2023), and conduct evaluation on Big-Bench-Hard (BBH (Suzgun et al., 2022)) and the test set of Grade School Math (GSM8K (Cobbe et al., 2021)) corpora, respectively. Moreover, coding skills are evaluated on Human Eval (Chen et al., 2021) dataset after models are finetuned on Code Alpaca (Chaudhary, 2023) dataset. In our evaluation, we report the exact match (EM) score under a zero-shot setting. We assess models using 8-shot examples and chain-of-thoughts (Co T). Our evaluation employs 3 official few-shot examples without chain-of-thought (Direct). We adopt the gold passage (GP) setting, where the correct answer is provided in a reference passage, utilize one-shot prompting. We report the pass@1 metric using zero-shot prompting with a sampling temperature of 0.1.
Hardware Specification Yes Our experiments, which are conducted on a single NVIDIA A100-40G GPU.
Software Dependencies No The paper mentions using QLoRA (Dettmers et al., 2023), Paged AdamW Optimizer, and vLLM (Kwon et al., 2023) for inference, but it does not specify version numbers for any of these software components or libraries.
Experiment Setup Yes We utilize 4-bit quantized versions of LLa MA2-7B, 13B, and Llama3.2-3B models in our experiments. Next, we apply Lo RA to all linear layers within the Transformer blocks, including the query, key, value, output, up, gate, and down projection weights, setting the scaling factor α to 16 and the dropout rate to 0.1. For more efficient finetuning, we also follow the configuration from Wang et al. (2024b) to set a batch size of 16 and the maximum sequence length to 512, truncating samples during preprocessing if needed. We also cap the maximum gradient norm at 0.3 to enhance training stability. LLa MA2-7B and 13B undergo 10,000 steps of finetuning using a linear learning rate scheduler with a warmup ratio of 3%, while LLa MA3.2-3B is finetuned for one epoch in each task. Additionally, we search for the optimal learning rate for Lo RA, and apply this value to both Lo RA and MOS. Specifically, with a Lo RA rank of 8, we search for the best learning rate from {2e-5, 5e-5, 1e-4, 2e-4, 5e-4, 1e-3, 2e-3}. Our preliminary experiments demonstrate that 2e-4 performs the best.