On the Duality between Gradient Transformations and Adapters
Authors: Lucas Torroba Hennigen, Hunter Lang, Han Guo, Yoon Kim
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our empirical experiments study this connection between linear gradient transformations and adapter-based reparameterizations in the context of memory-efficient LLM training. First, we perform a comparison across gradient projection-based and Lo RA-based approaches for memory-efficient training and find that randomly sketching gradients works particularly well ( 4.1). ...Results. The results are shown in Tab. 2, where we follow the original Ga Lore paper and use a rank of 256 for the 200M model and a rank of 512 for the 1.3B model, and further merge the adapters into the full weights and reinitialize them every 200 steps. We see that one-sided transformations, regardless of their nature, perform somewhat similarly at both 200M and 1.3B scale... |
| Researcher Affiliation | Academia | 1Massachusetts Institute of Technology. Correspondence to: Lucas Torroba-Hennigen <EMAIL>. |
| Pseudocode | No | The paper includes mathematical equations and theoretical proofs, particularly in Appendix A, but does not contain any clearly labeled 'Pseudocode' or 'Algorithm' blocks or sections with structured, code-like steps. |
| Open Source Code | No | The paper does not contain an explicit statement from the authors about releasing their source code, nor does it provide a direct link to a code repository for the methodology described in the paper. It references 'the official implementation in https://github.com/jiaweizzhao/Ga Lore' which is a third-party implementation for a related work, not the authors' own code. |
| Open Datasets | Yes | We use the Llama Transformer architecture (Touvron et al., 2023a) and train on the Slim Pajama (Soboleva et al., 2023) dataset, tokenized using the Llama-2 (Touvron et al., 2023b) tokenizer, using sequences of length 2048. |
| Dataset Splits | No | We consider two moderate-scale language modeling settings: a 200M setting (training on 5B tokens) and a 1.3B setting (training on 10B tokens). ... All numbers we report are perplexity on a disjoint (validation) set of Slim Pajama. The paper mentions training on a specific token count and evaluating on a 'disjoint (validation) set' but does not specify the percentages or absolute counts for training, validation, or test splits. The methodology for how the Slim Pajama dataset was partitioned is not detailed. |
| Hardware Specification | No | The paper mentions general hardware terms such as 'LLM training makes use of accelerators like GPUs' and discusses 'GPU memory' but does not specify any particular GPU models, CPU types, or other detailed hardware specifications used for running their experiments. It only refers vaguely to 'limited compute resources'. |
| Software Dependencies | No | The paper mentions software components like 'Adam W', 'Llama Transformer architecture', and 'bfloat16 precision' and a tokenizer, but it does not provide specific version numbers for any programming languages (e.g., Python), libraries (e.g., PyTorch, TensorFlow), or other software dependencies required to reproduce the experiments. |
| Experiment Setup | Yes | We use Adam W (Loshchilov & Hutter, 2019) with weight decay 0.1, β1 = 0.9 and β2 = 0.95. We warm up the learning rate to 4 10 4, before decaying it via a cosine decay schedule to 1 10 4. We conduct all training in bfloat16 precision. ... We follow the original Ga Lore paper and use a rank of 256 for the 200M model and a rank of 512 for the 1.3B model, and further merge the adapters into the full weights and reinitialize them every 200 steps. |