Analyzing Neural Scaling Laws in Two-Layer Networks with Power-Law Data Spectra

Authors: Roman Worschech, Bernd Rosenow

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Figure 2: Generatlization error ̈g for linear activation function. Left: ̈g evaluated using Eq. (8) (blue) and Eq. (6) (orange) for N = 128, K = M = 1, ̃2 J = 1,  = 1, and  = 1. Right: ̈g evaluated using Eq. (9) (dashed orange) compared to simulations experiments averaged over 15 random initializations (solid blue), with N = L = 1024,  = 0.75,  = 0.01, and ̃J = 0.01. Figure 3: Generalization error ̈g for different trainable input dimensions Nl of the student network. Left: ̈g as a function of  for various Nl, with L = N = 256, K = M = 1, ̃J = 0.01,  = 0.05, and  = 1. The student network is trained on synthetic data and the teacher s outputs. Right: ̈g as a function of , with L = N = 1024, K = M = 1, ̃J = 0.01, and  = 0.05. The student network is trained on the CIFAR-5m dataset Nakkiran et al. (2021) using the teacher s outputs. We estimate the scaling exponent   0.3 for this dataset. For the theoretical predictions, the empirical data spectrum is used to evaluate Eq. (11). Both plots compare the simulation results (solid curves) to the theoretical prediction from Eq. (11) (black dashed lines). Figure 6: Scaling behavior of the generalization error ̈g in the asymptotic regime for a non-linear activation function. Left: ̈g as a function of  for K = M = 40,  = 0.01, ̃J = 10 6 and L = N = 512 for simulations averaged over 10 different initializations.
Researcher Affiliation Academia Roman Worschech1,2 Bernd Rosenow1 1Institut fr Theoretische Physik, Universitt Leipzig, Brderstrae 16, 04103 Leipzig, Germany 2Max Planck Institute for Mathematics in the Sciences, Inselstrae 22, 04103 Leipzig, Germany
Pseudocode No The paper presents mathematical derivations, equations, and describes methods in text, but there are no explicitly labeled "Pseudocode" or "Algorithm" blocks, nor are there structured steps formatted like code.
Open Source Code No The paper does not contain an explicit statement about releasing source code, a link to a code repository, or mention of code being available in supplementary materials for the methodology described in this paper.
Open Datasets Yes The student network is trained on the CIFAR-5m dataset Nakkiran et al. (2021) using the teacher s outputs.
Dataset Splits No The paper does not explicitly provide details about specific training, validation, or test dataset splits (e.g., percentages, sample counts, or references to predefined splits with specific details) for reproducibility.
Hardware Specification No The paper does not provide specific hardware details such as GPU or CPU models, processor types, or detailed computer specifications used for running the simulations or experiments.
Software Dependencies No In Appendix G, we utilized Julia, a high-level scripting language, with arbitrary precision arithmetic. The text mentions "Julia" but does not specify its version number, nor does it list other key software components with versions (e.g., libraries or specific solvers).
Experiment Setup Yes Figure 2: Generatlization error ̈g for linear activation function. Left: ̈g evaluated using Eq. (8) (blue) and Eq. (6) (orange) for N = 128, K = M = 1, ̃2 J = 1,  = 1, and  = 1. Right: ̈g evaluated using Eq. (9) (dashed orange) compared to simulations experiments averaged over 15 random initializations (solid blue), with N = L = 1024,  = 0.75,  = 0.01, and ̃J = 0.01. Figure 4: Symmetric plateau for a non-linear activation function. Left and center: Plateau behavior of the order parameters for L = 10, N = 7000, ̃J = 0.01,  = 0.1, and M = K = 4, using one random initialization of the student and teacher vectors.