A Theory of Initialisation's Impact on Specialisation

Authors: Devon Jarvis, Sebastian Lee, Clementine Domine, Andrew Saxe, Stefano Sarao Mannelli

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Using theoretical frameworks for the analysis of neural networks, we show a strong dependence of specialisation on the initial condition... We then apply these insights in the context of continual learning, first showing the emergence of a monotonic relation between task-similarity and forgetting in non-specialised networks... To empirically support the linear network theory, we extend the results on inbalanced initialisation and apply them, beyond the limited setting of our framework, in the context of disentangled representation learning... Specifically, we implement a β-VAE model, employing the Deep Gaussian Linear architecture for the decoder and the Deep Linear architecture for the encoder... Results are shown in Fig. 4... In Fig. 12, we show forgetting profiles for three different initialisation schemes (analogous to those shown in Fig. 6) for the continual MNIST task described above. We conduct the following experiment in two phases: Phase 1: We train a standard VAE (similar to Sec. 3.2) on MNIST... The results of this experiment are shown in Fig. 13.
Researcher Affiliation Academia 1School of Computer Science and Applied Mathematics, University of the Witwatersrand; 2Center for Computational Neuroscience, Flatiron Institute, Simons Foundation; 3Gatsby Computational Neuroscience Unit & Sainsbury Wellcome Centre, UCL; 4Data Science and AI, Computer Science and Engineering, Chalmers University of Technology and University of Gothenburg; 5Machine Intelligence and Neural Discovery Institute, University of the Witwatersrand; 6CIFAR Azrieli Global Scholar, CIFAR;
Pseudocode Yes Algorithm 1 An algorithm for constructing Fig. 3d). Hyperparameters used: S = 105, Λ1 = [0, 100], Λ1 = [0, 20], ˆϵ = 5.0, ϵ = 1.0, η = 1e 5
Open Source Code No The paper mentions using open-source frameworks and implementations (Locatello et al. (2019); Abdi et al. (2019)) but does not explicitly state that the authors are releasing their own code or provide a link to a repository for the specific methodology described in this paper.
Open Datasets Yes To empirically support the linear network theory, we extend the results on inbalanced initialisation and apply them...on the 3DShapes dataset (Burgess & Kim, 2018)... In Appendix G, we complement these results with experiments on a task constructed around MNIST... We train a standard VAE (similar to Sec. 3.2) on MNIST
Dataset Splits Yes DCI Disentanglement Eastwood & Williams (2018) define three key properties of learned representations... In this implementation, we sample 10,000 training and 5,000 test points
Hardware Specification No The paper does not provide specific hardware details such as GPU models, CPU models, or memory specifications used for running the experiments.
Software Dependencies No The paper mentions software components like 'Adam optimiser' and 'Scikit-learn' but does not specify their version numbers, which is required for reproducible software dependencies.
Experiment Setup Yes The model is trained using the Adam optimiser, optimising a loss function that combines KL divergence and binary cross-entropy-based reconstruction loss. Additional details are given in Appendix D. In these experiments, we adjust the variance of the weights in a deep fully-connected encoder, by varying the constant gain of the Xavier initialisation (Glorot & Bengio, 2010). Specifically, the first block of layers was initialised with gain g while the readout layer received a gain 1/g... We run the experiments for 20 Epochs and 157499 iterations... Params: N = 10000, η = 1, p = 1, p = 2, σw = 0.001... The encoder weights are initialised by sampling from a Gaussian with standard deviation σ = 0.001 1/υ. The decoder weights are sampled from a Gaussian with standard deviation σ = 0.001υ.