Function-Space Learning Rates
Authors: Edward Milsom, Ben Anson, Laurence Aitchison
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this section we analyse function-space learning rates for concrete neural networks (Section 4.1), and investigate the use of FLe RM to enable hyperparameter transfer when scaling model width, depth, initialisation scale, and Lo RA rank (Section 4.2). We demonstrate FLe RM s utility across a range of scenarios, including model width scaling, depth scaling, initialisation scale variation, and even Lo RA rank adjustment. In all plots, train loss is averaged over the last 200 batches of training. |
| Researcher Affiliation | Academia | 1University of Bristol. Correspondence to: Edward Milsom <EMAIL>, Laurence Aitchison <EMAIL>. |
| Pseudocode | Yes | Algorithm 1 Recording (red) or setting (FLe RM, blue) function-space learning rates in a training loop. |
| Open Source Code | Yes | We provide our code at https: //github.com/edwardmilsom/ function-space-learning-rates-paper |
| Open Datasets | Yes | The base Res MLP is an MLP with 4 hidden layers, each with residual connections, trained for 50 epochs on flattened CIFAR-10 images (Krizhevsky & Hinton, 2009). The base transformer is decoder-only, has two self-attention + feedforward blocks (Vaswani et al., 2017), and is trained on a subset of the Wikitext-103 dataset (Merity et al., 2016). The datasets used were 4M token subsets of Cold French Law (Harvard Library Innovation Lab, 2024) and Mathpile (Wang et al., 2023). |
| Dataset Splits | No | The paper describes training on CIFAR-10 for 50 epochs, a subset of Wikitext-103 using a batch size of 20 and sequence length of 256 for 1 epoch, and 4M token subsets of Cold French Law and Mathpile for 500 iterations with batch size 8 and sequence length 512. While these describe training methodologies, they do not explicitly provide specific training/validation/test dataset splits (e.g., percentages, sample counts, or references to predefined splits for evaluation). |
| Hardware Specification | No | This work was carried out using the computational facilities of the Advanced Computing Research Centre, University of Bristol http://www.bris.ac.uk/acrc/. We would like to thank Dr. Stewart for GPU compute resources. No specific GPU or CPU models, memory details, or cloud instance specifications are provided. |
| Software Dependencies | No | FLe RM can be applied to any existing neural network in Pytorch. We tokenise the dataset using the GPT2 tokeniser from the Hugging Face transformer library (Wolf et al., 2020). The Lo RA adapters were initialized using the gaussian initialization provided by Hugging Face s peft library (Mangrulkar et al., 2022). While software like Pytorch, Hugging Face Transformers, and PEFT are mentioned, no specific version numbers are provided for any of them. |
| Experiment Setup | Yes | Both Res MLP and the transformers used the Adam optimiser (Kingma, 2014) with a constant learning rate schedule. We initialise all weight matrices using Kaiming / He initialisation (He et al., 2015)... Biases are initialised to 0. We train for 50 epochs on the CIFAR-10 dataset... using a batch size of 20 and a sequence length of 256. We trained for 500 iterations with a batchsize of 8 and sequence length 512. When sweeping for B, we used a fixed learning rate of 10 4 for A. When sweeping for A, we fixed the learning rate of B as follows: 10 3 for GPT-2, 10 4 for Llama-3.2-1B / Cold French Law, and 5 10 5 for Llama-3.2-1B/Mathpile. |