Three Mechanisms of Feature Learning in a Linear Network

Authors: Yizhou Xu, Liu Ziyin

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our theoretical findings are substantiated with empirical evidence showing that these mechanisms also manifest in deep nonlinear networks handling real-world tasks, enhancing our understanding of neural network training dynamics and guiding the design of more effective learning strategies. We empirically validate the three mechanisms of feature learning and our phase diagrams in realistic nonlinear networks. See Figure 1, where we show that the evolution of ζ of two-layer networks with d = 10000. It is trained on a regression task. Similar experimental results are observed for a classification task trained with the cross-entropy loss (Appendix C). See Figure 2 for a four-layer fully connected network (FCN) with Re LU activation and different initialization scales trained on MNIST datasets. A numerical result is presented in Figure 3, where a larger initialization leads to worse performance. We implement a two-layer FCN on the CIFAR-10 dataset with Re LU activation.
Researcher Affiliation Collaboration Yizhou Xu1, Liu Ziyin2,3 1Computer and Communication Sciences, Ecole Polytechnique F ed erale de Lausanne 2Research Laboratory of Electronics, Massachusetts Institute of Technology 3Physics & Informatics Laboratories, NTT Research
Pseudocode No The paper describes theoretical models and mathematical derivations (e.g., Theorem 1, Proposition 1) but does not present any structured pseudocode or algorithm blocks.
Open Source Code No The paper does not contain any explicit statements about releasing source code, nor does it provide links to any code repositories.
Open Datasets Yes See Figure 2 for a four-layer fully connected network (FCN) with Re LU activation and different initialization scales trained on MNIST datasets. Figure 3: The initialization scale σ correlates negatively with the performance of Resnet-18 on the CIFAR-10 dataset. We implement a two-layer FCN on the CIFAR-10 dataset with Re LU activation.
Dataset Splits Yes We implement a two-layer FCN on the CIFAR-10 dataset with Re LU activation... The cross-entropy loss and the stochastic gradient descent without moment or weight decay are used during training. We use a batch size of 128 and report the best training and test accuracy among all epochs.
Hardware Specification No The paper mentions training models like 'Resnet18 network' and 'two-layer FCN' but does not specify any hardware details such as GPU/CPU models, memory, or cloud computing resources used for these experiments.
Software Dependencies No The paper mentions hyperparameters borrowed from a GitHub repository for a Resnet18 network ('https://github.com/kuangliu/pytorch-cifar') but does not specify exact version numbers for any software components like Python, PyTorch, or specific libraries.
Experiment Setup Yes The learning rates are chosen separately for each model such that the model converges well in 1000 iterations, with training accuracy above 95%. All models use the standard Kaiming initialization, but we scale each layer by σ. In Figure 3, we train a Resnet18 network on the CIFAR-10 dataset with hyperparameters borrowed from https://github.com/kuangliu/pytorch-cifar. We use a batch size of 128 and report the best training and test accuracy among all epochs. We choose γ = 1/d and η = 0.05 for the standard NTK model, γ = 10/d and learning rate η = 0.05d/100 for the standard mean-field model, γ = 1 and η = 0.05d/100 for the Kaiming model, γ = 100/d and η = 0.05d/100 for the Kaiming+ model, γ = 1 and η = 0.05 for the Xavier+ model, γ = 0.01d and η = 0.05(100/d)2 for the Xavier model.