How Two-Layer Neural Networks Learn, One (Giant) Step at a Time
Authors: Yatin Dandi, Florent Krzakala, Bruno Loureiro, Luca Pesce, Ludovic Stephan
JMLR 2024 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our central goal is to paint a complete picture of how two-layer neural networks adapt to the features of training data (zν, yν)n ν=1 Rd+1 in the early phase of training after the first few steps of gradient descent. ... Appendix A. Numerical investigation. In this section we explain the procedures to get the different figures in the main text, along with the details behind the numerical experiments. We provide as well additional plots corroborating the theoretical results presented in the main manuscript. The code is available on Git Hub. |
| Researcher Affiliation | Academia | Yatin Dandi EMAIL Florent Krzakala EMAIL Bruno Loureiro EMAIL Luca Pesce EMAIL Ludovic Stephan EMAIL Information, Learning and Physics (Ide PHICS) Laboratory Ecole Polytechnique F ed erale de Lausanne Route Cantonale, 1015 Lausanne, Switzerland D epartement d Informatique Ecole Normale Sup erieure PSL & CNRS 45 rue d Ulm, F-75230 Paris cedex 05, France Univ Rennes, Ensai, CNRS, CREST UMR 9194 F-35000 Rennes, France Statistical Physics Of Computation (SPOC) Laboratory Ecole Polytechnique F ed erale de Lausanne Route Cantonale, 1015 Lausanne, Switzerland |
| Pseudocode | Yes | Appendix A. Numerical investigation... Description of training algorithm and hyperparameters: First, we describe the training protocol reported in Alg. 1: we separately update the first layer with T GD steps of learning rate η, followed by training with standard ridge regression for the second layer with fixed regularization strength λ. ... Algorithm 1 Training procedure |
| Open Source Code | Yes | The code to reproduce our figures is available on Git Hub, and we refer to App. A for details on the numerical implementations. |
| Open Datasets | No | In this work, we focus on a popular synthetic data model consisting of: a) independently drawn standard Gaussian covariates zν N(0, Id); b) a target function yν = f (zν) depending only on a finite number of relevant directions, also known as a multi-index model. |
| Dataset Splits | No | for every gradient step t T, a fresh batch of training data {(zν, yν)}n ν=1 is drawn from the model in Assumption 1, and the first layer weights are updated according to: ... (ii) Second layer training: once the first layer is trained for T steps, the second layer weights a are trained to optimality on an independent batch of data by performing ridge regression with the features learned in the first step: ... We note that the paper uses synthetically generated data in batches rather than predefined train/test/validation splits from a fixed dataset. |
| Hardware Specification | No | The paper mentions parameters like 'd = 28', 'd = 512', 'p = 256', 'p = 1024', which refer to data dimensions or model architecture, not specific hardware components. No specific hardware (GPU, CPU models, memory, etc.) is mentioned. |
| Software Dependencies | No | The paper does not explicitly list any software dependencies with version numbers, such as programming languages or libraries. |
| Experiment Setup | Yes | Description of training algorithm and hyperparameters: First, we describe the training protocol reported in Alg. 1: we separately update the first layer with T GD steps of learning rate η, followed by training with standard ridge regression for the second layer with fixed regularization strength λ. We vary adaptively the learning rate to satisfy the hypothesis of Thm. 5, i.e. η = O(pp n d), and we take noiseless labels. If not stated otherwise, we consider fixed regularization strenghth λ = 1. We average over 10 different seeds to get the mean performance, and we use standard deviation for giving confidence intervals. |