reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

On the Duality between Gradient Transformations and Adapters

Authors: Lucas Torroba Hennigen, Hunter Lang, Han Guo, Yoon Kim

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our empirical experiments study this connection between linear gradient transformations and adapter-based reparameterizations in the context of memory-efficient LLM training. First, we perform a comparison across gradient projection-based and Lo RA-based approaches for memory-efficient training and find that randomly sketching gradients works particularly well ( 4.1). ...Results. The results are shown in Tab. 2, where we follow the original Ga Lore paper and use a rank of 256 for the 200M model and a rank of 512 for the 1.3B model, and further merge the adapters into the full weights and reinitialize them every 200 steps. We see that one-sided transformations, regardless of their nature, perform somewhat similarly at both 200M and 1.3B scale...
Researcher Affiliation	Academia	1Massachusetts Institute of Technology. Correspondence to: Lucas Torroba-Hennigen <EMAIL>.
Pseudocode	No	The paper includes mathematical equations and theoretical proofs, particularly in Appendix A, but does not contain any clearly labeled 'Pseudocode' or 'Algorithm' blocks or sections with structured, code-like steps.
Open Source Code	No	The paper does not contain an explicit statement from the authors about releasing their source code, nor does it provide a direct link to a code repository for the methodology described in the paper. It references 'the official implementation in https://github.com/jiaweizzhao/Ga Lore' which is a third-party implementation for a related work, not the authors' own code.
Open Datasets	Yes	We use the Llama Transformer architecture (Touvron et al., 2023a) and train on the Slim Pajama (Soboleva et al., 2023) dataset, tokenized using the Llama-2 (Touvron et al., 2023b) tokenizer, using sequences of length 2048.
Dataset Splits	No	We consider two moderate-scale language modeling settings: a 200M setting (training on 5B tokens) and a 1.3B setting (training on 10B tokens). ... All numbers we report are perplexity on a disjoint (validation) set of Slim Pajama. The paper mentions training on a specific token count and evaluating on a 'disjoint (validation) set' but does not specify the percentages or absolute counts for training, validation, or test splits. The methodology for how the Slim Pajama dataset was partitioned is not detailed.
Hardware Specification	No	The paper mentions general hardware terms such as 'LLM training makes use of accelerators like GPUs' and discusses 'GPU memory' but does not specify any particular GPU models, CPU types, or other detailed hardware specifications used for running their experiments. It only refers vaguely to 'limited compute resources'.
Software Dependencies	No	The paper mentions software components like 'Adam W', 'Llama Transformer architecture', and 'bfloat16 precision' and a tokenizer, but it does not provide specific version numbers for any programming languages (e.g., Python), libraries (e.g., PyTorch, TensorFlow), or other software dependencies required to reproduce the experiments.
Experiment Setup	Yes	We use Adam W (Loshchilov & Hutter, 2019) with weight decay 0.1, β1 = 0.9 and β2 = 0.95. We warm up the learning rate to 4 10 4, before decaying it via a cosine decay schedule to 1 10 4. We conduct all training in bfloat16 precision. ... We follow the original Ga Lore paper and use a rank of 256 for the 200M model and a rank of 512 for the 1.3B model, and further merge the adapters into the full weights and reinitialize them every 200 steps.