Random Feature Amplification: Feature Learning and Generalization in Neural Networks
Authors: Spencer Frei, Niladri S. Chatterji, Peter L. Bartlett
JMLR 2023 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this work, we provide a characterization of the feature-learning process in two-layer Re LU networks trained by gradient descent on the logistic loss following random initialization. We consider data with binary labels that are generated by an XOR-like function of the input features. We permit a constant fraction of the training labels to be corrupted by an adversary. We show that, although linear classifiers are no better than random guessing for the distribution we consider, two-layer Re LU networks trained by gradient descent achieve generalization error close to the label noise rate. We develop a novel proof technique that shows that at initialization, the vast majority of neurons function as random features that are only weakly correlated with useful features, and the gradient descent dynamics amplify these weak, random features to strong, useful features. ... We plot the decision boundary resulting from training a two-layer Re LU network given n = 5000 samples... The network was trained for T = 3000 iterations... In Figure 2, we examine the behavior of two-layer Re LU networks trained by gradient descent on the logistic loss for the 2-XOR distribution we consider when 15% of the labels are flipped... |
| Researcher Affiliation | Collaboration | Spencer Frei EMAIL Simons Institute for the Theory of Computing, University of California, Berkeley, Calvin Lab #230, Berkeley, CA 94720 Niladri S. Chatterji EMAIL Computer Science Department, Stanford University, 353 Jane Stanford Way, Stanford, CA 94305 Peter L. Bartlett EMAIL University of California, Berkeley & Google Deep Mind, 367 Evans Hall #3860 Berkeley, CA 994720 |
| Pseudocode | No | The paper describes algorithms and methods in prose, but does not include any distinct, labeled pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not contain any explicit statement about providing source code for the described methodology, nor does it include a link to a code repository. |
| Open Datasets | No | The paper describes using a synthetic dataset generated by an XOR-like function of input features from a uniform mixture of four clusters. It does not provide access information (link, DOI, citation) to a publicly available dataset, nor does it make its generated data publicly available. |
| Dataset Splits | No | The paper states, "Validation accuracy is measured using n = 6000 samples." and for Figure 1, "given n = 5000 samples". However, it does not explicitly provide details on how these samples are split into training, validation, or test sets (e.g., percentages, methodology, or specific files for custom splits). The training data is described as "generated as i.i.d. samples from P" and test error is defined over the distribution P, which does not constitute a fixed dataset split. |
| Hardware Specification | No | The paper does not provide any specific hardware details such as GPU models, CPU types, or other machine specifications used for running the experiments. |
| Software Dependencies | No | The paper does not list any specific software components with version numbers (e.g., programming languages, libraries, frameworks, or specialized solvers) that were used in the experimental setup. |
| Experiment Setup | Yes | In Figure 1, the paper states: "T = 3000 iterations, with network width m = 500, step-size α = 0.05, and initialization variance ω2 init = 1/(32m)". Appendix E, for Figure 2, states: "m = 400 neurons. The within-cluster distribution is Gaussian, Pclust N(0, σ2Id), where the within-cluster variance is given by σ2 = 1/d1.2 and we flip 15% of the labels within each cluster the orthogonal cluster s label. We initialize using centered Gaussians with variance ω2 init = 0.01/md and run with a step-size of α = 0.1." |