LoRA Learns Less and Forgets Less
Authors: Dan Biderman, Jacob Portes, Jose Javier Gonzalez Ortiz, Mansheej Paul, Philip Greengard, Connor Jennings, Daniel King, Sam Havens, Vitaliy Chiley, Jonathan Frankle, Cody Blakeney, John Patrick Cunningham
TMLR 2024 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this work, we compare the performance of Lo RA and full finetuning on two target domains, programming and mathematics. We consider both the instruction finetuning ( 100K prompt-response pairs) and continued pretraining ( 20B unstructured tokens) data regimes. Our results show that, in the standard low-rank settings, Lo RA substantially underperforms full finetuning. Nevertheless, Lo RA better maintains the base model s performance on tasks outside the target domain. We show that Lo RA mitigates forgetting more than common regularization techniques such as weight decay and dropout; it also helps maintain more diverse generations. Finally, we show that full finetuning learns perturbations with a rank that is 10-100 greater than typical Lo RA configurations, possibly explaining some of the reported gaps. We conclude by proposing best practices for finetuning with Lo RA. |
| Researcher Affiliation | Collaboration | 1Columbia University EMAIL 2Databricks Mosaic Research EMAIL |
| Pseudocode | No | The paper includes a mathematical formulation of LoRA (Wfinetuned = Wpretrained + = γr AB, A Rd r, B Rr k) but does not present any pseudocode or algorithm blocks with structured steps. |
| Open Source Code | Yes | Model checkpoints and Lo RA adapters can be accessed at https://github.com/danbider/lora-tradeoffs. |
| Open Datasets | Yes | The first regime is continued pretraining, which involves training on billions of unlabeled domain-specific tokens, most commonly via full finetuning; here we use the Star Coder-Python (Li et al., 2023a) and Open Web Math (Paster et al., 2023) datasets (Table 1). The second is instruction finetuning, the common scenario for Lo RA involving questionanswer datasets with tens to hundreds of millions of tokens. Here, we use Magicoder-Evol-Instruct-110K (Wei et al., 2023) and Meta Math QA (Yu et al., 2023). |
| Dataset Splits | Yes | Math GSM8K (Cobbe et al., 2021) This benchmark includes a collection of 8.5K grade-school math word problems. We evaluate on the test split of GSM8K (1,319 samples) as implemented in the LM Evaluation Harness (Gao et al., 2023), with default generation parameters (temperature=0, 5 few-shot, pass@1). |
| Hardware Specification | Yes | All experiments were done with the Databricks Mosaic ML composer, streaming and llm-foundry libraries in conjunction with the Hugging Face peft library on 32 H100-80GB GPUs. |
| Software Dependencies | No | All training was done using the Databricks Mosaic ML composer1, streaming2, and llm-foundry3 repositories, as well as the Hugging Face peft library. While these libraries are mentioned, specific version numbers are not provided. |
| Experiment Setup | Yes | A Experimental Setup Lo RA configuration for all experiments. All experiments were done with the Databricks Mosaic ML composer, streaming and llm-foundry libraries in conjunction with the Hugging Face peft library on 32 H100-80GB GPUs. We targeted all trainable modules inside each of the L Llama transformer blocks: {W (l) q , W (l) k , W (l) v , W (l) o , W (l) gate, W (l) up , W (l) down}}L l=1. We used ranks of r = 16, 64, 256 and set α = 2r, to achieve a constant scaling factor γr = 2 across ranks. We use lora_dropout=0.05. For both the Code CPT and Math CPT settings, we train the model once for 20B tokens. We then perform individual cooldowns using intermediate checkpoints as follows: We set a target max training duration (e.g. 8 billion tokens), and define the last 20% of max training duration as the cooldown period. We then retrain from the latest available checkpoint prior to the cooldown period. Code CPT Llama-2-7B trained on the Star Coder-Python dataset. seq_len: 4096 optimizer: decoupled_lionw (betas=[0.9, 0.95]) learning_rate: 1.0e-05 for Lo RA and Full Finetuning scheduler: inv_sqrt_with_warmup (t_scale=1000ba, t_warmup=1000ba, t_cooldown=5086ba, alpha_f_decay=1, alpha_f_cooldown=0). We note that this ends up looking very much like a trapezoidal schedule. weight_decay: 1.0e-06 precision: amp_bf16 global_train_batch_size: 192 device_train_microbatch_size: 6 gradient_clipping: norm (threshold=1) num_gpus: 32 LR Scheduler: Inverse square root with warmup twarmup = 500 batches, tscale = 500 batches, tcooldown = 5200 batches αfdecay = 1.0 αfcooldown = 0.0 |