Position: Deep Learning is Not So Mysterious or Different

Authors: Andrew Gordon Wilson

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Indeed, we will aim to introduce the simplest examples possible, often basic linear models, to replicate these phenomena and explain the intuition behind them. The hope is that by relying on particularly simple examples, we can drive home the point that these generalization behaviours are hardly distinct to neural networks and can in fact be understood with basic principles. For example, in Figure 1, we show that benign overfitting and double descent can be reproduced and explained with simple linear models. (...) Figure 1. Generalization phenomena associated with deep learning can be reproduced with simple linear models and understood. Top: Benign Overfitting. A 150th order polynomial with order-dependent regularization reasonably describes (a) simple and (b) complex structured data, while also being able to perfectly fit (c) pure noise. (d) A Gaussian process exactly reproduces the CIFAR-10 results in Zhang et al. (2016), perfectly fitting noisy labels, but still achieving reasonable generalization. Moreover, for both the GP and (e) Res Net, the marginal likelihood, directly corresponding to PAC-Bayes bounds (Germain et al., 2016), decreases with more altered labels, as in Wilson & Izmailov (2020). Bottom: Double Descent. Both the (f) Res Net and (g) linear random feature model display double descent, with effective dimensionality closely tracking the second descent in the low training loss regime as in Maddox et al. (2020).
Researcher Affiliation Academia Andrew Gordon Wilson 1 1New York University. Correspondence to: Andrew Gordon Wilson <EMAIL>.
Pseudocode No The paper does not contain any sections explicitly labeled as 'Pseudocode' or 'Algorithm', nor does it present structured steps in a code-like format.
Open Source Code No The paper does not explicitly state that the authors are releasing source code for the methodology described in this paper, nor does it provide any links to a code repository.
Open Datasets Yes Figure 1. (...) (d) A Gaussian process exactly reproduces the CIFAR-10 results in Zhang et al. (2016), perfectly fitting noisy labels, but still achieving reasonable generalization. (...) (e) Res Net-20 on CIFAR-10 (...) Figure 1 (bottom left), showing cross-entropy loss on CIFAR-100 with increases in the width of each layer of a Res Net-18, training to convergence. (...) Figure 1(d)(e) is adapted from Wilson & Izmailov (2020), which uses a Gaussian process with an RBF kernel, and a Pre Res Net-20 and isotropic prior p(w) = N(0, α2I) and Laplace marginal likelihood, and in turn replicates the CIFAR-10 noisy label experiment in Zhang et al. (2016).
Dataset Splits Yes Figure 6 fits two 15th order polynomials and one 2nd order polynomial to data generated from a 2nd order polynomial, 15th order polynomial, and cos( 3 2πx). One of the 15th order polynomials uses the order-dependent regularization P j 0.012j2w2 j. Train and test input locations are sampled from N(0, 1). The number of test samples is 100 and the number of train samples range from 10 to 100.
Hardware Specification No The paper does not provide specific hardware details (such as GPU models, CPU types, or memory amounts) used for running its experiments.
Software Dependencies No The paper does not specify any software dependencies with version numbers (e.g., 'Python 3.8, PyTorch 1.9, and CUDA 11.1') that would be needed to replicate the experiment.
Experiment Setup Yes Appendix F. Experimental Details. In Figure 1(a)(b)(c), we use a 150th order polynomial with order-dependent regularization P j 2jw2 j (green) to fit regression data generated from (a) sin(x) cos(x2), (b) x + cos(πx), (c) N(0, 1) noise. (...) in Figure 1(g) we use the random feature least squares model Xw = y with each column of Xi = yi + ϵ where ϵ N(0, 1). We measure MSE, and use α = 10 to compute the effective dimensionality of the parameter covariance matrix (inverse Hessian). (...) Figure 6 fits two 15th order polynomials and one 2nd order polynomial to data generated from a 2nd order polynomial, 15th order polynomial, and cos( 3 2πx). One of the 15th order polynomials uses the order-dependent regularization P j 0.012j2w2 j. Train and test input locations are sampled from N(0, 1).