Wide Neural Networks Trained with Weight Decay Provably Exhibit Neural Collapse
Authors: Arthur Jacot, Peter Súkeník, Zihan Wang, Marco Mondelli
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our numerical experiments on various architectures (fully connected, Res Net) and datasets (MNIST, CIFAR) confirm the insights coming from the theory: (i) NC2 is more prominent as the depth of the linear head increases, and (ii) the final linear layers are balanced at convergence. Furthermore, we show that, as the non-linear part of the network gets deeper, the non-negative layers become less non-linear and more balanced. |
| Researcher Affiliation | Academia | Courant Institute of Mathematical Sciences, NYU. Email: EMAIL Institute of Science and Technology Austria. Email: EMAIL Courant Institute of Mathematical Sciences, NYU. Email: EMAIL Institute of Science and Technology Austria. Email: EMAIL |
| Pseudocode | No | The paper describes methods mathematically and textually but does not contain any clearly labeled pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not contain any explicit statements about releasing source code, nor does it provide links to a code repository. |
| Open Datasets | Yes | In all experiments, we consider MSE loss and standard weight decay regularization. We train an MLP and a Res Net20 with an added MLP head on standard datasets (MNIST, CIFAR10), considering as backbone the first two layers for the MLP and the whole architecture before the linear head for the Res Net. |
| Dataset Splits | No | The paper mentions using "standard datasets (MNIST, CIFAR10)" but does not specify the exact training, validation, or test splits used for these datasets. |
| Hardware Specification | No | The paper does not provide specific details about the hardware (e.g., GPU models, CPU types) used to run the experiments. |
| Software Dependencies | No | The paper does not specify any software dependencies with version numbers (e.g., programming languages, libraries, frameworks, or solvers). |
| Experiment Setup | Yes | We use weight decay of 0.001 and learning rate of 0.001, training for 5000 epochs (the learning rate drops ten-fold after 80% of the epochs in all our experiments). ... We average over 5 runs per each combination of weight decay (0.001, 0.004) and with learning rate of 0.001. |