Adaptive Rank Allocation: Speeding Up Modern Transformers with RaNA Adapters
Authors: Roberto Garcia, Jerry Liu, Daniel Sorvisto, Sabri Eyuboglu
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We empirically validate the rank-adaptive framework and the Ra NA adapter. We demonstrate that, similar to neuron-adaptive setups, the ranks of the AB matrix decomposition in the proposed Ra NA adapters have sparse importances, depending on the input (Figs. 2a, 2b), allowing us to dynamically prune them. We also show that Ra NA adapters attain the lowest error in MLP layers when recovering the full MLP outputs, outperforming neuron-adaptive methods by 6.7, 18.1 and 7.4 percentage-points on Llama2-7b, Gemma-2b and Pythia-160M respectively. Further, we show the effectiveness of Ra NA in modern Transformer architectures by applying it to Llama (Touvron et al. (2023)) and Gemma (Team et al. (2024)). |
| Researcher Affiliation | Academia | Roberto Garcia , Jerry Liu , Daniel Sorvisto and Sabri Eyuboglu Institute of Computational and Mathematical Engineering, Stanford University Department of Computer Science, Stanford University |
| Pseudocode | Yes | A.4 ALGORITHM We provide a pseudocode implementation of Ra NA below: Algorithm 1 RANA Layer Compression Algorithm 2 RANA MLP Transformation Algorithm 3 RANA Forward Pass |
| Open Source Code | Yes | Further, code is available at https://github.com/Roberto09/Ra NA. |
| Open Datasets | Yes | We use the Red Pajama (Computer (2023)) dataset for Llama2-7b and Gemma-2b, and the Pile (Gao et al. (2020)) dataset for Pythia models when evaluating rank contribution sparsity (Sect. 5.2), output errors (Sect. 5.3), and perplexity (Sect. 5.3), and for devising any data-dependent adapter component (e.g. A and B matrices in Ra NA, the activation threshold in CATS and the slicing and rotating procedure of Slice GPT). |
| Dataset Splits | No | The paper states: "Perplexity is measured on a held-out subset of each model s fine-tuning dataset." and "For our latency evaluations, we leverage 100 sequences from the Red Pajama dataset, where adapted models are timed in the task of decoding a sequence of 492 tokens with an initial context ranging from 1 to 1000 tokens.", and mentions zero-shot or five-shot settings for downstream tasks. However, it does not provide specific percentages, counts, or explicit train/test/validation split methodologies for the primary datasets (Red Pajama, Pile) used in experiments. |
| Hardware Specification | Yes | Latency Evaluations: For our latency evaluations, we leverage 100 sequences from the Red Pajama dataset, where adapted models are timed in the task of decoding a sequence of 492 tokens with an initial context ranging from 1 to 1000 tokens. Evaluations are performed on an NVIDIA L40S GPU. |
| Software Dependencies | No | The paper mentions "fine-tune adapted models using the Huggingface library (Wolf et al. (2020)) and Lo RA adapters (Hu et al. (2021))" and refers to "a custom masked GEMV kernel... implemented in Triton (Tillet et al. (2019))". However, no specific version numbers are provided for Huggingface, LoRA, or Triton. |
| Experiment Setup | Yes | Fine-tuning: To assess accuracy and perplexity (Sect. 5.3), we fine-tune adapted models using the Huggingface library (Wolf et al. (2020)) and Lo RA adapters (Hu et al. (2021)) for 31M tokens on Llama2-7b and Gemma-2b, with an Adam W optimizer, where learning rates were determined from various options, depending on the model s performance following 6M tokens of training. In a similar setup, Pythia models are fine-tuned for 61M tokens with the exception of not leveraging Lo RA adapters. |