reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

LAuReL: Learned Augmented Residual Layer

Authors: Gaurav Menghani, Ravi Kumar, Sanjiv Kumar

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our experiments show that LAUREL can enhance quality for both vision and language models while adding fewer parameters and incurring less latency and memory overhead than naively increasing parameter count. For example, on the Image Net-1K task, LAUREL achieves the same model quality improvements as naively adding an extra layer while using 2.6 fewer parameters. Similarly, when pretraining 1B and 4B parameter LLMs, LAUREL improves performance on a variety of challenging downstream evaluation tasks by 2.54% to 20.05%, while adding only 0.012% and 0.1% additional parameters, respectively.
Researcher Affiliation	Industry	1Google Research, Mountain View, CA. EMAIL, EMAIL 2Google Research, New York, NY. EMAIL.
Pseudocode	No	The paper uses mathematical equations (e.g., Equation 1, 2) and diagrams (e.g., Figure 1, 2) to describe the LAUREL framework and its variants. However, there are no sections or figures explicitly labeled as 'Pseudocode' or 'Algorithm' presenting structured, code-like steps for any method or procedure.
Open Source Code	No	The paper does not contain an explicit statement about the release of their code for the LAUREL framework, nor does it provide a link to a code repository. It mentions `nano GPT` and `LoRA` as related work but does not offer code for its own contributions.
Open Datasets	Yes	We experiment with LAUREL in two domains, namely, vision and language. For the first case, our goal is to improve the image classification accuracy of the Res Net-50 model on the Image Net-1K dataset (Deng et al., 2009). We used the C4 corpus (Raffel et al., 2020) with 10B tokens... We evaluated both the pre-trained baseline and LAUREL models on a host of common LLM tasks such as Q&A, NLU, Math, Code, etc; see Table 2 for the results. The task type and individual tasks are listed in the first and second columns respectively, and a higher score is better for all the tasks.
Dataset Splits	No	The paper mentions training a standard ResNet-50 model on the ImageNet-1K dataset and pre-training LLMs with specific token amounts. It evaluates models on common benchmarks like MATH, MMLU, etc., which typically have predefined splits. However, the paper does not explicitly detail the specific training, validation, and test splits used for its own experiments (e.g., percentages, sample counts, or specific split files), beyond what might be inherent to the referenced benchmarks.
Hardware Specification	Yes	In this setup we train a standard Res Net-50 model on the Image Net 1K dataset (Deng et al., 2009) using 16 Google Cloud TPUv5e chips over one epoch with data-augmentation turned on. Both the models were trained using 256 Google Cloud TPU v5e chips for approximately two weeks each... In this second setting, both the baseline and the LAUREL experiment were trained using 1024 Google Cloud TPU v4 chips for slightly more than two days each. We used the C4 corpus (Raffel et al., 2020) with 10B tokens, and a 4 x 4 Google Cloud TPU v6e (Trillium) topology for compute.
Software Dependencies	No	The paper does not explicitly list any software dependencies with specific version numbers (e.g., Python, PyTorch, TensorFlow versions or other libraries).
Experiment Setup	Yes	In this setup we train a standard Res Net-50 model on the Image Net 1K dataset (Deng et al., 2009) using 16 Google Cloud TPUv5e chips over one epoch with data-augmentation turned on. In order to obtain a strong baseline, we fine-tuned the model learning rate schedule and picked a schedule that maximized the average of the best accuracy@1 values over 5 trials... we use the LAUREL-RW and LAUREL-LR versions (with r = 4). Both the models were trained using 256 Google Cloud TPU v5e chips for approximately two weeks each... For the LAUREL-LR variants and its combinations, we picked r = 32. Similarly for LAUREL-PA variant and its combinations, we chose k = 3.