BP($\mathbf{\lambda}$): Online Learning via Synthetic Gradients
Authors: Joseph Oliver Pemberton, Rui Ponte Costa
TMLR 2024 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Next, we tested empirically the ability of accumulate BP(λ), which we henceforth simply call BP(λ), to produce good synthetic gradients and can drive effective RNN parameter updates. In all experiments we take the synthesiser computation g simply as a linear function of the RNN hidden state, g(ht; θ) = θht. We first analyse the alignment of BP(λ)-derived synthetic gradients and true gradients derived by full BPTT, which we use to quantify the bias in synthesiser predictions. For this we consider a toy task in which a fixed (randomly connected) linear RNN receives a static input x1 at timestep 1 and null input onwards, xt = 0 for t > 1. To test the ability of BP(λ) to transfer error information across time the error is only defined at the last timestep LT , where LT is the mean-squared error (MSE) between a two-dimensional target y T and a linear readout of the final hidden activity h T . |
| Researcher Affiliation | Academia | Joseph Pemberton EMAIL Computational Neuroscience Unit, Faculty of Engineering, University of Bristol, United Kingdom Centre for Neural Circuits and Behaviour, Department of Physiology, Anatomy and Genetics, Medical Sciences Division, University of Oxford, United Kingdom Rui Ponte Costa EMAIL Centre for Neural Circuits and Behaviour, Department of Physiology, Anatomy and Genetics, Medical Sciences Division, University of Oxford, United Kingdom Computational Neuroscience Unit, Faculty of Engineering, University of Bristol, United Kingdom |
| Pseudocode | Yes | Algorithm 1 RNN learning with accumulate BP(λ). Updates RNN parameters using estimated gradients provided by synthesiser function g. Input: Ψ0, θ0, {(xt, yt)}1 t T , η, α, γ, λ Ψ Ψ0 {init. RNN parameters} θ θ0 {init. synthesiser parameters} h, h, e 0 {init. RNN state, Jacobian, and elig. trace} for t = 1 to T do e γλ he + g(h;θ) θ {update eligibility trace} h f(xt, h; Ψ), L L(h , yt) {compute next hidden state and task loss} h h Ψ {compute local gradients} δ [ L + γg(h ; θ)] h g(h; θ) {compute synthesiser TD error} θ αδ e {update synthesiser parameters} Ψ η[ L + g(h ; θ)] Ψ {update RNN parameters} h h {update RNN hidden state} end for |
| Open Source Code | Yes | Code used for the experiments can be found on the Github page: https://github.com/neuralml/bp_lambda. |
| Open Datasets | Yes | To test the ability of BP(λ) to generalise to non-trivial tasks, we now consider the sequential MNIST task (Le et al., 2015). In this task the RNN is provided with a row-by-row representation of an MNIST image which it must classify at the end (Fig. 4a). ... For the sequential-MNIST task we present a row-by-row presentation of a given MNIST image to the model (Deng, 2012; Le et al., 2015). |
| Dataset Splits | No | The paper mentions using a validation set for model selection in the sequential MNIST task (During training, the models with the lowest validation score over 50 epochs are selected to produce the final test error.) but does not provide specific percentages or counts for training, validation, or test splits. For the toy and copy-repeat tasks, it defines criteria for a 'solved' sequence length or curriculum rather than dataset splits. |
| Hardware Specification | Yes | The toy task experiments used to analyse gradient alignment were conducted with an Intel i7-8665U CPU, where each run with a particular seed took approximately one minute. The sequential MNIST task and copy-repeat tasks were conducted on NVIDIA Ge Force RTX 2080 Ti GPUs. Each run in the sequential MNIST task took approximately 3 hours or less (depending on the model used); each run in the copy-repeat task took approximately 12 hours or less. |
| Software Dependencies | No | All experiments are run using the Py Torch library. ... We use an ADAM optimiser for gradient descent on the model parameters (Kingma & Ba, 2014). |
| Experiment Setup | Yes | For this task we provide the model in batches of size 10, where 1 epoch involves 100 batches. The number of RNN units is 30 and the initial learning rate for the synthesiser is set as 1 10 4 and 1 10 3 for the fixed and plastic RNN cases respectively. ... For this task we provide the model in batches of size 50 with an initial learning rate of 3 10 4. The number of hidden LSTM units is 30. For BP(λ) the synthetic gradient is scaled by a factor of 0.1. ... For this task we provide the model in batches of size 100. For the RNN and readout parameters we use an initial learning rate of 1 10 3, whilst we find a smaller learning rate of 1 10 5 for the synthesiser parameters necessary for stable learning. The number of hidden LSTM units is 100. |