reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Linear Weight Interpolation Leads to Transient Performance Gains

Authors: Gaurav Iyer, Gintare Karolina Dziugaite, David Rolnick

TMLR 2024 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We train copies of a neural network on different sets of SGD noise and find that linearly interpolating their weights can, remarkably, produce networks that perform significantly better than the original networks. However, such interpolated networks consistently end up in unfavorable regions of the optimization landscape: with further training, their performance fails to improve or degrades, effectively undoing the performance gained from the interpolation. We identify two quantities that impact an interpolated network s performance and relate our observations to linear mode connectivity. Finally, we investigate this phenomenon from the lens of example importance and find that performance improves and degrades almost exclusively on the harder subsets of the training data, while performance is stable on the easier subsets. Our work represents a step towards a better understanding of neural network loss landscapes and weight interpolation in deep learning.
Researcher Affiliation	Collaboration	Gaurav Iyer EMAIL Mc Gill University Mila Quebec AI Institute Gintare Karolina Dziugaite EMAIL Google Deep Mind Mc Gill University Mila Quebec AI Institute David Rolnick EMAIL Mc Gill University Mila Quebec AI Institute
Pseudocode	No	The paper describes experimental methods and presents results through figures and prose. It does not contain any clearly labeled pseudocode or algorithm blocks.
Open Source Code	No	The paper does not contain any explicit statement about releasing source code, nor does it provide a link to a code repository.
Open Datasets	Yes	We will show the results of our experiment on CIFAR-10 (Krizhevsky, 2009) with Res Net20 networks (He et al., 2016) here (specifically Fig. 2 in this context) additional results on different datasets and network architectures can be found in Appendix E and onward. Figure 1: We train two copies of a network on different sets of SGD noise, then average their weights and continue training on the resulting network. We find that test accuracy shoots up upon interpolation, but then precipitously drops and improvement stalls. The network initialization was trained for k = 10 epochs before being cloned into child networks A and B. A and B were then trained on different SGD trajectories for s = 10 epochs, before being averaged and trained further for 10 epochs. Figure 18: Progression of test performance of networks resulting from linear weight interpolation of child networks on CINIC-10 (Darlow et al., 2018) with a Res Net20.
Dataset Splits	Yes	In Fig. 5, we track the performance of the child networks and the averaged network across 4 equal splits of CIFAR-10 s training set. The data is split according to the EL2N scores of the training examples computed at epoch 10 of training each split consists of 12500 examples, with Split 1 and Split 4 containing examples with the lowest and highest EL2N scores respectively.
Hardware Specification	No	The authors acknowledge material support from NVIDIA and Intel in the form of computational resources and are grateful for technical support from the Mila IDT team in maintaining the Mila Compute Cluster. This statement is too general and does not provide specific hardware models or configurations.
Software Dependencies	No	The paper discusses training neural networks, implying the use of deep learning frameworks, but does not provide specific software dependencies with version numbers.
Experiment Setup	Yes	A Experimental Setup For all experiments, networks were trained with a batch size of 128 for a total of 160 epochs. A stepwise learning rate schedule was employed with an initial learning rate of 0.1 the learning rate is reduced by a factor of 10 at epoch 80 and epoch 120. When linear warmup is employed over t iterations, the learning rate is i t l at iteration i where l is the initial learning rate. All datasets were also augmented with random crops to a size of 32x32 pixels after padding 4 pixels to each border in the original image, with a constant value of 0, and horizontal image flips with a probability of 0.5.