reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Adaptive Rank Allocation: Speeding Up Modern Transformers with RaNA Adapters

Authors: Roberto Garcia, Jerry Liu, Daniel Sorvisto, Sabri Eyuboglu

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We empirically validate the rank-adaptive framework and the Ra NA adapter. We demonstrate that, similar to neuron-adaptive setups, the ranks of the AB matrix decomposition in the proposed Ra NA adapters have sparse importances, depending on the input (Figs. 2a, 2b), allowing us to dynamically prune them. We also show that Ra NA adapters attain the lowest error in MLP layers when recovering the full MLP outputs, outperforming neuron-adaptive methods by 6.7, 18.1 and 7.4 percentage-points on Llama2-7b, Gemma-2b and Pythia-160M respectively. Further, we show the effectiveness of Ra NA in modern Transformer architectures by applying it to Llama (Touvron et al. (2023)) and Gemma (Team et al. (2024)).
Researcher Affiliation	Academia	Roberto Garcia , Jerry Liu , Daniel Sorvisto and Sabri Eyuboglu Institute of Computational and Mathematical Engineering, Stanford University Department of Computer Science, Stanford University
Pseudocode	Yes	A.4 ALGORITHM We provide a pseudocode implementation of Ra NA below: Algorithm 1 RANA Layer Compression Algorithm 2 RANA MLP Transformation Algorithm 3 RANA Forward Pass
Open Source Code	Yes	Further, code is available at https://github.com/Roberto09/Ra NA.
Open Datasets	Yes	We use the Red Pajama (Computer (2023)) dataset for Llama2-7b and Gemma-2b, and the Pile (Gao et al. (2020)) dataset for Pythia models when evaluating rank contribution sparsity (Sect. 5.2), output errors (Sect. 5.3), and perplexity (Sect. 5.3), and for devising any data-dependent adapter component (e.g. A and B matrices in Ra NA, the activation threshold in CATS and the slicing and rotating procedure of Slice GPT).
Dataset Splits	No	The paper states: "Perplexity is measured on a held-out subset of each model s fine-tuning dataset." and "For our latency evaluations, we leverage 100 sequences from the Red Pajama dataset, where adapted models are timed in the task of decoding a sequence of 492 tokens with an initial context ranging from 1 to 1000 tokens.", and mentions zero-shot or five-shot settings for downstream tasks. However, it does not provide specific percentages, counts, or explicit train/test/validation split methodologies for the primary datasets (Red Pajama, Pile) used in experiments.
Hardware Specification	Yes	Latency Evaluations: For our latency evaluations, we leverage 100 sequences from the Red Pajama dataset, where adapted models are timed in the task of decoding a sequence of 492 tokens with an initial context ranging from 1 to 1000 tokens. Evaluations are performed on an NVIDIA L40S GPU.
Software Dependencies	No	The paper mentions "fine-tune adapted models using the Huggingface library (Wolf et al. (2020)) and Lo RA adapters (Hu et al. (2021))" and refers to "a custom masked GEMV kernel... implemented in Triton (Tillet et al. (2019))". However, no specific version numbers are provided for Huggingface, LoRA, or Triton.
Experiment Setup	Yes	Fine-tuning: To assess accuracy and perplexity (Sect. 5.3), we fine-tune adapted models using the Huggingface library (Wolf et al. (2020)) and Lo RA adapters (Hu et al. (2021)) for 31M tokens on Llama2-7b and Gemma-2b, with an Adam W optimizer, where learning rates were determined from various options, depending on the model s performance following 6M tokens of training. In a similar setup, Pythia models are fine-tuned for 61M tokens with the exception of not leveraging Lo RA adapters.