LoRA-Pro: Are Low-Rank Adapters Properly Optimized?

Authors: Zhengbo Wang, Jian Liang, Ran He, Zilei Wang, Tieniu Tan

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conduct extensive experiments across natural language understanding, dialogue generation, mathematical reasoning, code generation, and image classification tasks, demonstrating that Lo RA-Pro substantially improves Lo RA s performance, effectively narrowing the gap with full fine-tuning. Our code is publicly available at https://github.com/mrflogs/Lo RA-Pro.
Researcher Affiliation Academia Zhengbo Wang1,2 Jian Liang2,3 Ran He2,3 Zilei Wang1 Tieniu Tan2,4 1 University of Science and Technology of China 2 NLPR & MAIS, Institute of Automation, Chinese Academy of Sciences 3 School of Artificial Intelligence, University of Chinese Academy of Sciences 4 Nanjing University EMAIL, EMAIL
Pseudocode Yes C OPTIMIZATION ALGORITHMS In this section, we present the pseudo-codes for implementing our Lo RA-Pro method using the SGD (Sutskever et al., 2013) and Adam W (Loshchilov & Hutter, 2019) optimizers. The details are provided in Algorithm 1 and Algorithm 2, respectively.
Open Source Code Yes Our code is publicly available at https://github.com/mrflogs/Lo RA-Pro.
Open Datasets Yes First, we assess natural language understanding capabilities using the GLUE benchmark by fine-tuning the T5-base (Raffel et al., 2020) model in Section 3.1. Next, we evaluate its capabilities in dialogue generation, mathematical reasoning, and code generation using the Llama-2-7B model (Touvron et al., 2023), covered in Section 3.2. We then examine Lo RA-Pro s effectiveness on image classification tasks using the CLIP-Vi T-B/16 (Radford et al., 2021) model in Section 3.3.
Dataset Splits Yes For the dialogue generation task, we fine-tune the Llama-2-7B (Touvron et al., 2023) model on a 52k subset of the Wizard LM dataset (Xu et al., 2024) and evaluate it using the MTBench dataset (Zheng et al., 2024a). For the math task, we fine-tune the Llama-2-7B (Touvron et al., 2023) model on a 100k sample from the Meta Math QA dataset (Yu et al., 2024). The model is then evaluated on the GSM8K test set (Cobbe et al., 2021), and we report the accuracy as the metric. For the coding task, we fine-tune the Llama-2-7B (Touvron et al., 2023) model on a 100k subset of the Code Feedback dataset (Zheng et al., 2024b) and test it on the Human Eval dataset (Chen et al., 2021), reporting the PASS@1 metric.
Hardware Specification Yes All experiments are conducted on NVIDIA RTX A6000 GPUs. Memory cost is measured using a single A6000 GPU with a batch size of 1. Training time is recorded on the Meta Math QA dataset using 8 A100 GPUs with Deep Speed Ze RO-2 stage optimization.
Software Dependencies No To ensure a fair comparison, we align our experimental setup with that of Lo RA-GA (Wang et al., 2024a). By default, we fine-tune the model using the Adam W optimizer (Loshchilov & Hutter, 2019) with hyper-parameters β1 = 0.9, β2 = 0.999, and weight decay set to 0. We implement a cosine learning rate schedule with a warmup ratio of 0.03. Lo RA is applied to all linear modules, excluding the embedding layer, normalization layer, and classification head.
Experiment Setup Yes Training details. To ensure a fair comparison, we align our experimental setup with that of Lo RA-GA (Wang et al., 2024a). By default, we fine-tune the model using the Adam W optimizer (Loshchilov & Hutter, 2019) with hyper-parameters β1 = 0.9, β2 = 0.999, and weight decay set to 0. We implement a cosine learning rate schedule with a warmup ratio of 0.03. Lo RA is applied to all linear modules, excluding the embedding layer, normalization layer, and classification head. By default, we set the rank r = 8 and α = 16. For natural language understanding tasks, we fine-tune T5-base (Raffel et al., 2020) model with learning rate 1e-4. The sequence length is set to 128, and the training batch size is 32.