LAuReL: Learned Augmented Residual Layer
Authors: Gaurav Menghani, Ravi Kumar, Sanjiv Kumar
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experiments show that LAUREL can enhance quality for both vision and language models while adding fewer parameters and incurring less latency and memory overhead than naively increasing parameter count. For example, on the Image Net-1K task, LAUREL achieves the same model quality improvements as naively adding an extra layer while using 2.6 fewer parameters. Similarly, when pretraining 1B and 4B parameter LLMs, LAUREL improves performance on a variety of challenging downstream evaluation tasks by 2.54% to 20.05%, while adding only 0.012% and 0.1% additional parameters, respectively. |
| Researcher Affiliation | Industry | 1Google Research, Mountain View, CA. EMAIL, EMAIL 2Google Research, New York, NY. EMAIL. |
| Pseudocode | No | The paper uses mathematical equations (e.g., Equation 1, 2) and diagrams (e.g., Figure 1, 2) to describe the LAUREL framework and its variants. However, there are no sections or figures explicitly labeled as 'Pseudocode' or 'Algorithm' presenting structured, code-like steps for any method or procedure. |
| Open Source Code | No | The paper does not contain an explicit statement about the release of their code for the LAUREL framework, nor does it provide a link to a code repository. It mentions `nano GPT` and `LoRA` as related work but does not offer code for its own contributions. |
| Open Datasets | Yes | We experiment with LAUREL in two domains, namely, vision and language. For the first case, our goal is to improve the image classification accuracy of the Res Net-50 model on the Image Net-1K dataset (Deng et al., 2009). We used the C4 corpus (Raffel et al., 2020) with 10B tokens... We evaluated both the pre-trained baseline and LAUREL models on a host of common LLM tasks such as Q&A, NLU, Math, Code, etc; see Table 2 for the results. The task type and individual tasks are listed in the first and second columns respectively, and a higher score is better for all the tasks. |
| Dataset Splits | No | The paper mentions training a standard ResNet-50 model on the ImageNet-1K dataset and pre-training LLMs with specific token amounts. It evaluates models on common benchmarks like MATH, MMLU, etc., which typically have predefined splits. However, the paper does not explicitly detail the specific training, validation, and test splits used for its own experiments (e.g., percentages, sample counts, or specific split files), beyond what might be inherent to the referenced benchmarks. |
| Hardware Specification | Yes | In this setup we train a standard Res Net-50 model on the Image Net 1K dataset (Deng et al., 2009) using 16 Google Cloud TPUv5e chips over one epoch with data-augmentation turned on. Both the models were trained using 256 Google Cloud TPU v5e chips for approximately two weeks each... In this second setting, both the baseline and the LAUREL experiment were trained using 1024 Google Cloud TPU v4 chips for slightly more than two days each. We used the C4 corpus (Raffel et al., 2020) with 10B tokens, and a 4 x 4 Google Cloud TPU v6e (Trillium) topology for compute. |
| Software Dependencies | No | The paper does not explicitly list any software dependencies with specific version numbers (e.g., Python, PyTorch, TensorFlow versions or other libraries). |
| Experiment Setup | Yes | In this setup we train a standard Res Net-50 model on the Image Net 1K dataset (Deng et al., 2009) using 16 Google Cloud TPUv5e chips over one epoch with data-augmentation turned on. In order to obtain a strong baseline, we fine-tuned the model learning rate schedule and picked a schedule that maximized the average of the best accuracy@1 values over 5 trials... we use the LAUREL-RW and LAUREL-LR versions (with r = 4). Both the models were trained using 256 Google Cloud TPU v5e chips for approximately two weeks each... For the LAUREL-LR variants and its combinations, we picked r = 32. Similarly for LAUREL-PA variant and its combinations, we chose k = 3. |