Deep linear networks can benignly overfit when shallow ones do

Authors: Niladri S. Chatterji, Philip M. Long

JMLR 2023 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our simulations verify that aspects of our bounds reflect typical behavior for simple data distributions. We also find that similar phenomena are seen in simulations with Re LU networks, although the situation there is more nuanced. Figure 1 contains plots from simulation experiments where the excess risk of a deep linear model increases with the scale of the initialization of the first layer, as in the upper bounds of our analysis. Inspired by our theory, we ran simulations to study the excess risk of several linear networks and Re LU networks as a function of both the initialization scale and dimension.
Researcher Affiliation Collaboration Niladri S. Chatterji EMAIL Computer Science Department, Stanford University, 353 Jane Stanford Way, Stanford, CA 94305. Philip M. Long EMAIL Google, 1600 Amphitheatre Parkway, Mountain View, CA, 94043.
Pseudocode No The paper describes theoretical proofs and simulation results, but no pseudocode or algorithm blocks are explicitly presented.
Open Source Code Yes 1. Code at https://github.com/niladri-chatterji/Benign-Deep-Linear
Open Datasets No The generative model for the underlying data was y = xΘ + ω, where x N(0, Σ) and ω N(0, 1)... the generative model for the underlying data was y = f (x) + ω, where x N(0, I10 10) and ω N(0, 1). The paper describes synthetic data generation process rather than using or providing a publicly available dataset with concrete access information.
Dataset Splits No The model is trained on n = 100 points drawn from the generative model y = xΘ + ω... The networks are trained on n = 500 samples... The excess risk is defined as Ex,y y xΘ 2 y xΘ 2 , where x, y are test samples that are independent of Θ. While the paper mentions the number of training samples and evaluation on independent test samples, it does not specify explicit dataset split percentages, counts, or a detailed splitting methodology for reproducibility.
Hardware Specification No The paper describes simulation setups and training procedures, but does not provide any specific details about the hardware used to run these experiments (e.g., GPU/CPU models, memory).
Software Dependencies No The paper mentions training models with 'full-batch gradient descent' and 'Re LU networks,' but does not provide specific software names or version numbers (e.g., Python, PyTorch, TensorFlow versions) used for implementation.
Experiment Setup Yes All of the models are trained on the squared loss with full-batch gradient descent with step-size 10^-4, until the training loss is smaller than 10^-7. We train models that have 2 hidden layers (L = 3). The width of the middle layers m is set to be 10(d + q), where d is the input dimension and q is the output dimension. ... The width of the middle layers (m) is set to be 50.