RaSA: Rank-Sharing Low-Rank Adaptation
Authors: Zhiwei He, Zhaopeng Tu, Xing Wang, Xingyu Chen, Zhijie Wang, Jiahao Xu, Tian Liang, Wenxiang Jiao, Zhuosheng Zhang, Rui Wang
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our theoretically grounded and empirically validated approach demonstrates that Ra SA not only maintains the core advantages of Lo RA but also significantly boosts performance in challenging code and math tasks. Finally, we conducted experiments on mathematical reasoning and code generation, demonstrating that the lower reconstruction error translates to improved downstream task performance. |
| Researcher Affiliation | Collaboration | 1Shanghai Jiao Tong University 2Tencent AI Lab |
| Pseudocode | No | The paper describes update rules and derivations using mathematical equations (Equations 19, 24-26) in Appendix A, but does not present them in a clearly labeled 'Pseudocode' or 'Algorithm' block. |
| Open Source Code | Yes | Code, data and scripts are available at: https://github.com/zwhe99/Ra SA. |
| Open Datasets | Yes | Code Generation: We used Magicoder-Evol-Instruct-110k (Wei et al., 2024) as the training data, ... We used Humaneval+ (Liu et al., 2023) as the test set, an extension of the Humaneval benchmark... Mathematical Reasoning: We used Meta Math QA (Yu et al., 2024) as the training data, ... We used MATH (Hendrycks et al., 2021) as the test set... Specifically, we calculate prediction accuracies on the following three benchmarks: (1) Hella Swag (Zellers et al., 2019): ...; (2) Wino Grande (Sakaguchi et al., 2019): ...; (3) ARC-Challenge (Clark et al., 2018): ... |
| Dataset Splits | No | The paper designates specific datasets for training (e.g., Magicoder-Evol-Instruct-110k, Meta Math QA) and separate datasets for testing (e.g., Humaneval+, MATH), but does not specify how these datasets were split into training, validation, or test sets by the authors within their own experimental setup. For scaling analysis, it mentions 'random sampling of 25% and 50% instances from the SFT data for the mathematics reasoning task', but this is not a general dataset split for all experiments. |
| Hardware Specification | Yes | All experiments for the 7-8B models were conducted on 1 node 8 A100-40G GPUs. For the 70B and Mo E models, we used 8 nodes. |
| Software Dependencies | No | The paper mentions software tools like 'Bigcode Evaluation Harness' and 'LLMs Evaluation Harness' and libraries such as 'sympy' and 'Lion W optimizer', but does not provide specific version numbers for any of these components. |
| Experiment Setup | Yes | Following common practice (Kopiczko et al., 2024; Jiang et al., 2024), we used pre-trained models rather than instruction-tuned ones. We applied PEFTs on all linear modules from attention (Wq, Wk, Wv, Wo) and feed-forward networks (Wup, Wdown, Wgate). We set the model hyper-parameters based on the optimal configurations from Biderman et al. (2024), employing the decoupled Lion W optimizer with a batch size of 192, and training for 8 epochs with a learning rate of 5e-4 by default. For Ra SA, we set k = max(r/8, 1) based on the analysis in 3.2. More details are provided in appendix C. |