reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Function-Space Learning Rates

Authors: Edward Milsom, Ben Anson, Laurence Aitchison

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this section we analyse function-space learning rates for concrete neural networks (Section 4.1), and investigate the use of FLe RM to enable hyperparameter transfer when scaling model width, depth, initialisation scale, and Lo RA rank (Section 4.2). We demonstrate FLe RM s utility across a range of scenarios, including model width scaling, depth scaling, initialisation scale variation, and even Lo RA rank adjustment. In all plots, train loss is averaged over the last 200 batches of training.
Researcher Affiliation	Academia	1University of Bristol. Correspondence to: Edward Milsom <EMAIL>, Laurence Aitchison <EMAIL>.
Pseudocode	Yes	Algorithm 1 Recording (red) or setting (FLe RM, blue) function-space learning rates in a training loop.
Open Source Code	Yes	We provide our code at https: //github.com/edwardmilsom/ function-space-learning-rates-paper
Open Datasets	Yes	The base Res MLP is an MLP with 4 hidden layers, each with residual connections, trained for 50 epochs on flattened CIFAR-10 images (Krizhevsky & Hinton, 2009). The base transformer is decoder-only, has two self-attention + feedforward blocks (Vaswani et al., 2017), and is trained on a subset of the Wikitext-103 dataset (Merity et al., 2016). The datasets used were 4M token subsets of Cold French Law (Harvard Library Innovation Lab, 2024) and Mathpile (Wang et al., 2023).
Dataset Splits	No	The paper describes training on CIFAR-10 for 50 epochs, a subset of Wikitext-103 using a batch size of 20 and sequence length of 256 for 1 epoch, and 4M token subsets of Cold French Law and Mathpile for 500 iterations with batch size 8 and sequence length 512. While these describe training methodologies, they do not explicitly provide specific training/validation/test dataset splits (e.g., percentages, sample counts, or references to predefined splits for evaluation).
Hardware Specification	No	This work was carried out using the computational facilities of the Advanced Computing Research Centre, University of Bristol http://www.bris.ac.uk/acrc/. We would like to thank Dr. Stewart for GPU compute resources. No specific GPU or CPU models, memory details, or cloud instance specifications are provided.
Software Dependencies	No	FLe RM can be applied to any existing neural network in Pytorch. We tokenise the dataset using the GPT2 tokeniser from the Hugging Face transformer library (Wolf et al., 2020). The Lo RA adapters were initialized using the gaussian initialization provided by Hugging Face s peft library (Mangrulkar et al., 2022). While software like Pytorch, Hugging Face Transformers, and PEFT are mentioned, no specific version numbers are provided for any of them.
Experiment Setup	Yes	Both Res MLP and the transformers used the Adam optimiser (Kingma, 2014) with a constant learning rate schedule. We initialise all weight matrices using Kaiming / He initialisation (He et al., 2015)... Biases are initialised to 0. We train for 50 epochs on the CIFAR-10 dataset... using a batch size of 20 and a sequence length of 256. We trained for 500 iterations with a batchsize of 8 and sequence length 512. When sweeping for B, we used a fixed learning rate of 10 4 for A. When sweeping for A, we fixed the learning rate of B as follows: 10 3 for GPT-2, 10 4 for Llama-3.2-1B / Cold French Law, and 5 10 5 for Llama-3.2-1B/Mathpile.