Linear Weight Interpolation Leads to Transient Performance Gains
Authors: Gaurav Iyer, Gintare Karolina Dziugaite, David Rolnick
TMLR 2024 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We train copies of a neural network on different sets of SGD noise and find that linearly interpolating their weights can, remarkably, produce networks that perform significantly better than the original networks. However, such interpolated networks consistently end up in unfavorable regions of the optimization landscape: with further training, their performance fails to improve or degrades, effectively undoing the performance gained from the interpolation. We identify two quantities that impact an interpolated network s performance and relate our observations to linear mode connectivity. Finally, we investigate this phenomenon from the lens of example importance and find that performance improves and degrades almost exclusively on the harder subsets of the training data, while performance is stable on the easier subsets. Our work represents a step towards a better understanding of neural network loss landscapes and weight interpolation in deep learning. |
| Researcher Affiliation | Collaboration | Gaurav Iyer EMAIL Mc Gill University Mila Quebec AI Institute Gintare Karolina Dziugaite EMAIL Google Deep Mind Mc Gill University Mila Quebec AI Institute David Rolnick EMAIL Mc Gill University Mila Quebec AI Institute |
| Pseudocode | No | The paper describes experimental methods and presents results through figures and prose. It does not contain any clearly labeled pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not contain any explicit statement about releasing source code, nor does it provide a link to a code repository. |
| Open Datasets | Yes | We will show the results of our experiment on CIFAR-10 (Krizhevsky, 2009) with Res Net20 networks (He et al., 2016) here (specifically Fig. 2 in this context) additional results on different datasets and network architectures can be found in Appendix E and onward. Figure 1: We train two copies of a network on different sets of SGD noise, then average their weights and continue training on the resulting network. We find that test accuracy shoots up upon interpolation, but then precipitously drops and improvement stalls. The network initialization was trained for k = 10 epochs before being cloned into child networks A and B. A and B were then trained on different SGD trajectories for s = 10 epochs, before being averaged and trained further for 10 epochs. Figure 18: Progression of test performance of networks resulting from linear weight interpolation of child networks on CINIC-10 (Darlow et al., 2018) with a Res Net20. |
| Dataset Splits | Yes | In Fig. 5, we track the performance of the child networks and the averaged network across 4 equal splits of CIFAR-10 s training set. The data is split according to the EL2N scores of the training examples computed at epoch 10 of training each split consists of 12500 examples, with Split 1 and Split 4 containing examples with the lowest and highest EL2N scores respectively. |
| Hardware Specification | No | The authors acknowledge material support from NVIDIA and Intel in the form of computational resources and are grateful for technical support from the Mila IDT team in maintaining the Mila Compute Cluster. This statement is too general and does not provide specific hardware models or configurations. |
| Software Dependencies | No | The paper discusses training neural networks, implying the use of deep learning frameworks, but does not provide specific software dependencies with version numbers. |
| Experiment Setup | Yes | A Experimental Setup For all experiments, networks were trained with a batch size of 128 for a total of 160 epochs. A stepwise learning rate schedule was employed with an initial learning rate of 0.1 the learning rate is reduced by a factor of 10 at epoch 80 and epoch 120. When linear warmup is employed over t iterations, the learning rate is i t l at iteration i where l is the initial learning rate. All datasets were also augmented with random crops to a size of 32x32 pixels after padding 4 pixels to each border in the original image, with a constant value of 0, and horizontal image flips with a probability of 0.5. |