reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

A Theory of Initialisation's Impact on Specialisation

Authors: Devon Jarvis, Sebastian Lee, Clementine Domine, Andrew Saxe, Stefano Sarao Mannelli

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Using theoretical frameworks for the analysis of neural networks, we show a strong dependence of specialisation on the initial condition... We then apply these insights in the context of continual learning, first showing the emergence of a monotonic relation between task-similarity and forgetting in non-specialised networks... To empirically support the linear network theory, we extend the results on inbalanced initialisation and apply them, beyond the limited setting of our framework, in the context of disentangled representation learning... Specifically, we implement a β-VAE model, employing the Deep Gaussian Linear architecture for the decoder and the Deep Linear architecture for the encoder... Results are shown in Fig. 4... In Fig. 12, we show forgetting profiles for three different initialisation schemes (analogous to those shown in Fig. 6) for the continual MNIST task described above. We conduct the following experiment in two phases: Phase 1: We train a standard VAE (similar to Sec. 3.2) on MNIST... The results of this experiment are shown in Fig. 13.
Researcher Affiliation	Academia	1School of Computer Science and Applied Mathematics, University of the Witwatersrand; 2Center for Computational Neuroscience, Flatiron Institute, Simons Foundation; 3Gatsby Computational Neuroscience Unit & Sainsbury Wellcome Centre, UCL; 4Data Science and AI, Computer Science and Engineering, Chalmers University of Technology and University of Gothenburg; 5Machine Intelligence and Neural Discovery Institute, University of the Witwatersrand; 6CIFAR Azrieli Global Scholar, CIFAR;
Pseudocode	Yes	Algorithm 1 An algorithm for constructing Fig. 3d). Hyperparameters used: S = 105, Λ1 = [0, 100], Λ1 = [0, 20], ˆϵ = 5.0, ϵ = 1.0, η = 1e 5
Open Source Code	No	The paper mentions using open-source frameworks and implementations (Locatello et al. (2019); Abdi et al. (2019)) but does not explicitly state that the authors are releasing their own code or provide a link to a repository for the specific methodology described in this paper.
Open Datasets	Yes	To empirically support the linear network theory, we extend the results on inbalanced initialisation and apply them...on the 3DShapes dataset (Burgess & Kim, 2018)... In Appendix G, we complement these results with experiments on a task constructed around MNIST... We train a standard VAE (similar to Sec. 3.2) on MNIST
Dataset Splits	Yes	DCI Disentanglement Eastwood & Williams (2018) define three key properties of learned representations... In this implementation, we sample 10,000 training and 5,000 test points
Hardware Specification	No	The paper does not provide specific hardware details such as GPU models, CPU models, or memory specifications used for running the experiments.
Software Dependencies	No	The paper mentions software components like 'Adam optimiser' and 'Scikit-learn' but does not specify their version numbers, which is required for reproducible software dependencies.
Experiment Setup	Yes	The model is trained using the Adam optimiser, optimising a loss function that combines KL divergence and binary cross-entropy-based reconstruction loss. Additional details are given in Appendix D. In these experiments, we adjust the variance of the weights in a deep fully-connected encoder, by varying the constant gain of the Xavier initialisation (Glorot & Bengio, 2010). Specifically, the first block of layers was initialised with gain g while the readout layer received a gain 1/g... We run the experiments for 20 Epochs and 157499 iterations... Params: N = 10000, η = 1, p = 1, p = 2, σw = 0.001... The encoder weights are initialised by sampling from a Gaussian with standard deviation σ = 0.001 1/υ. The decoder weights are sampled from a Gaussian with standard deviation σ = 0.001υ.