On the Duality between Gradient Transformations and Adapters

Authors: Lucas Torroba Hennigen, Hunter Lang, Han Guo, Yoon Kim

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our empirical experiments study this connection between linear gradient transformations and adapter-based reparameterizations in the context of memory-efficient LLM training. First, we perform a comparison across gradient projection-based and Lo RA-based approaches for memory-efficient training and find that randomly sketching gradients works particularly well ( 4.1). ...Results. The results are shown in Tab. 2, where we follow the original Ga Lore paper and use a rank of 256 for the 200M model and a rank of 512 for the 1.3B model, and further merge the adapters into the full weights and reinitialize them every 200 steps. We see that one-sided transformations, regardless of their nature, perform somewhat similarly at both 200M and 1.3B scale...
Researcher Affiliation Academia 1Massachusetts Institute of Technology. Correspondence to: Lucas Torroba-Hennigen <EMAIL>.
Pseudocode No The paper includes mathematical equations and theoretical proofs, particularly in Appendix A, but does not contain any clearly labeled 'Pseudocode' or 'Algorithm' blocks or sections with structured, code-like steps.
Open Source Code No The paper does not contain an explicit statement from the authors about releasing their source code, nor does it provide a direct link to a code repository for the methodology described in the paper. It references 'the official implementation in https://github.com/jiaweizzhao/Ga Lore' which is a third-party implementation for a related work, not the authors' own code.
Open Datasets Yes We use the Llama Transformer architecture (Touvron et al., 2023a) and train on the Slim Pajama (Soboleva et al., 2023) dataset, tokenized using the Llama-2 (Touvron et al., 2023b) tokenizer, using sequences of length 2048.
Dataset Splits No We consider two moderate-scale language modeling settings: a 200M setting (training on 5B tokens) and a 1.3B setting (training on 10B tokens). ... All numbers we report are perplexity on a disjoint (validation) set of Slim Pajama. The paper mentions training on a specific token count and evaluating on a 'disjoint (validation) set' but does not specify the percentages or absolute counts for training, validation, or test splits. The methodology for how the Slim Pajama dataset was partitioned is not detailed.
Hardware Specification No The paper mentions general hardware terms such as 'LLM training makes use of accelerators like GPUs' and discusses 'GPU memory' but does not specify any particular GPU models, CPU types, or other detailed hardware specifications used for running their experiments. It only refers vaguely to 'limited compute resources'.
Software Dependencies No The paper mentions software components like 'Adam W', 'Llama Transformer architecture', and 'bfloat16 precision' and a tokenizer, but it does not provide specific version numbers for any programming languages (e.g., Python), libraries (e.g., PyTorch, TensorFlow), or other software dependencies required to reproduce the experiments.
Experiment Setup Yes We use Adam W (Loshchilov & Hutter, 2019) with weight decay 0.1, β1 = 0.9 and β2 = 0.95. We warm up the learning rate to 4 10 4, before decaying it via a cosine decay schedule to 1 10 4. We conduct all training in bfloat16 precision. ... We follow the original Ga Lore paper and use a rank of 256 for the 200M model and a rank of 512 for the 1.3B model, and further merge the adapters into the full weights and reinitialize them every 200 steps.