Analyzing Neural Scaling Laws in Two-Layer Networks with Power-Law Data Spectra
Authors: Roman Worschech, Bernd Rosenow
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Figure 2: Generatlization error ̈g for linear activation function. Left: ̈g evaluated using Eq. (8) (blue) and Eq. (6) (orange) for N = 128, K = M = 1, ̃2 J = 1, = 1, and = 1. Right: ̈g evaluated using Eq. (9) (dashed orange) compared to simulations experiments averaged over 15 random initializations (solid blue), with N = L = 1024, = 0.75, = 0.01, and ̃J = 0.01. Figure 3: Generalization error ̈g for different trainable input dimensions Nl of the student network. Left: ̈g as a function of for various Nl, with L = N = 256, K = M = 1, ̃J = 0.01, = 0.05, and = 1. The student network is trained on synthetic data and the teacher s outputs. Right: ̈g as a function of , with L = N = 1024, K = M = 1, ̃J = 0.01, and = 0.05. The student network is trained on the CIFAR-5m dataset Nakkiran et al. (2021) using the teacher s outputs. We estimate the scaling exponent 0.3 for this dataset. For the theoretical predictions, the empirical data spectrum is used to evaluate Eq. (11). Both plots compare the simulation results (solid curves) to the theoretical prediction from Eq. (11) (black dashed lines). Figure 6: Scaling behavior of the generalization error ̈g in the asymptotic regime for a non-linear activation function. Left: ̈g as a function of for K = M = 40, = 0.01, ̃J = 10 6 and L = N = 512 for simulations averaged over 10 different initializations. |
| Researcher Affiliation | Academia | Roman Worschech1,2 Bernd Rosenow1 1Institut fr Theoretische Physik, Universitt Leipzig, Brderstrae 16, 04103 Leipzig, Germany 2Max Planck Institute for Mathematics in the Sciences, Inselstrae 22, 04103 Leipzig, Germany |
| Pseudocode | No | The paper presents mathematical derivations, equations, and describes methods in text, but there are no explicitly labeled "Pseudocode" or "Algorithm" blocks, nor are there structured steps formatted like code. |
| Open Source Code | No | The paper does not contain an explicit statement about releasing source code, a link to a code repository, or mention of code being available in supplementary materials for the methodology described in this paper. |
| Open Datasets | Yes | The student network is trained on the CIFAR-5m dataset Nakkiran et al. (2021) using the teacher s outputs. |
| Dataset Splits | No | The paper does not explicitly provide details about specific training, validation, or test dataset splits (e.g., percentages, sample counts, or references to predefined splits with specific details) for reproducibility. |
| Hardware Specification | No | The paper does not provide specific hardware details such as GPU or CPU models, processor types, or detailed computer specifications used for running the simulations or experiments. |
| Software Dependencies | No | In Appendix G, we utilized Julia, a high-level scripting language, with arbitrary precision arithmetic. The text mentions "Julia" but does not specify its version number, nor does it list other key software components with versions (e.g., libraries or specific solvers). |
| Experiment Setup | Yes | Figure 2: Generatlization error ̈g for linear activation function. Left: ̈g evaluated using Eq. (8) (blue) and Eq. (6) (orange) for N = 128, K = M = 1, ̃2 J = 1, = 1, and = 1. Right: ̈g evaluated using Eq. (9) (dashed orange) compared to simulations experiments averaged over 15 random initializations (solid blue), with N = L = 1024, = 0.75, = 0.01, and ̃J = 0.01. Figure 4: Symmetric plateau for a non-linear activation function. Left and center: Plateau behavior of the order parameters for L = 10, N = 7000, ̃J = 0.01, = 0.1, and M = K = 4, using one random initialization of the student and teacher vectors. |