Three Mechanisms of Feature Learning in a Linear Network
Authors: Yizhou Xu, Liu Ziyin
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our theoretical findings are substantiated with empirical evidence showing that these mechanisms also manifest in deep nonlinear networks handling real-world tasks, enhancing our understanding of neural network training dynamics and guiding the design of more effective learning strategies. We empirically validate the three mechanisms of feature learning and our phase diagrams in realistic nonlinear networks. See Figure 1, where we show that the evolution of ζ of two-layer networks with d = 10000. It is trained on a regression task. Similar experimental results are observed for a classification task trained with the cross-entropy loss (Appendix C). See Figure 2 for a four-layer fully connected network (FCN) with Re LU activation and different initialization scales trained on MNIST datasets. A numerical result is presented in Figure 3, where a larger initialization leads to worse performance. We implement a two-layer FCN on the CIFAR-10 dataset with Re LU activation. |
| Researcher Affiliation | Collaboration | Yizhou Xu1, Liu Ziyin2,3 1Computer and Communication Sciences, Ecole Polytechnique F ed erale de Lausanne 2Research Laboratory of Electronics, Massachusetts Institute of Technology 3Physics & Informatics Laboratories, NTT Research |
| Pseudocode | No | The paper describes theoretical models and mathematical derivations (e.g., Theorem 1, Proposition 1) but does not present any structured pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not contain any explicit statements about releasing source code, nor does it provide links to any code repositories. |
| Open Datasets | Yes | See Figure 2 for a four-layer fully connected network (FCN) with Re LU activation and different initialization scales trained on MNIST datasets. Figure 3: The initialization scale σ correlates negatively with the performance of Resnet-18 on the CIFAR-10 dataset. We implement a two-layer FCN on the CIFAR-10 dataset with Re LU activation. |
| Dataset Splits | Yes | We implement a two-layer FCN on the CIFAR-10 dataset with Re LU activation... The cross-entropy loss and the stochastic gradient descent without moment or weight decay are used during training. We use a batch size of 128 and report the best training and test accuracy among all epochs. |
| Hardware Specification | No | The paper mentions training models like 'Resnet18 network' and 'two-layer FCN' but does not specify any hardware details such as GPU/CPU models, memory, or cloud computing resources used for these experiments. |
| Software Dependencies | No | The paper mentions hyperparameters borrowed from a GitHub repository for a Resnet18 network ('https://github.com/kuangliu/pytorch-cifar') but does not specify exact version numbers for any software components like Python, PyTorch, or specific libraries. |
| Experiment Setup | Yes | The learning rates are chosen separately for each model such that the model converges well in 1000 iterations, with training accuracy above 95%. All models use the standard Kaiming initialization, but we scale each layer by σ. In Figure 3, we train a Resnet18 network on the CIFAR-10 dataset with hyperparameters borrowed from https://github.com/kuangliu/pytorch-cifar. We use a batch size of 128 and report the best training and test accuracy among all epochs. We choose γ = 1/d and η = 0.05 for the standard NTK model, γ = 10/d and learning rate η = 0.05d/100 for the standard mean-field model, γ = 1 and η = 0.05d/100 for the Kaiming model, γ = 100/d and η = 0.05d/100 for the Kaiming+ model, γ = 1 and η = 0.05 for the Xavier+ model, γ = 0.01d and η = 0.05(100/d)2 for the Xavier model. |