A Generalization Bound for Nearly-Linear Networks
Authors: Eugene Golikov
TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We validate our bound on a simple fully-connected network trained on a downsampled MNIST dataset, and demonstrate that it becomes non-vacuous in this scenario (Section 6). |
| Researcher Affiliation | Academia | Eugene Golikov EMAIL Chair of Statistical Field Theory École Polytechnique Fédérale de Lausanne (EPFL) |
| Pseudocode | No | The paper describes mathematical equations and derivations of the bound, but does not include any clearly labeled pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not contain any explicit statements about releasing source code, nor does it provide links to a code repository. |
| Open Datasets | Yes | We validate our bound on a simple fully-connected network trained on a downsampled MNIST dataset... We consider 7x7 binary MNIST, L = 2, κ = 2, ϵ = 0.001, and vary β. The bound of Theorem 5.2 converges as β vanishes and increases as β grows. The bound stays non-vacuous for a small enough β and a properly choosen γ. We consider γ = β2/q for q {1, 10, 100}. |
| Dataset Splits | No | The paper mentions training on "60000" digits of MNIST for binary classification and later refers to a "test part of the MNIST dataset" in Appendix B.1, implying a split was used. However, it does not explicitly provide specific percentages, sample counts for train/test/validation, or a detailed splitting methodology for reproducibility (e.g., "80/10/10 split" or specific random seed). |
| Hardware Specification | No | The paper mentions setting floating point precision (p = 32 or p = 16) but does not specify any hardware details such as GPU/CPU models, processors, or memory used for running the experiments. |
| Software Dependencies | No | The paper mentions initializing layers "in a standard Pytorch way" but does not specify the version number of PyTorch or any other software dependencies with their versions. |
| Experiment Setup | Yes | We run gradient descent with learning rate 0.001. By default, we take L = 2, the floating point precision to be p = 32, downsample the images to 7x7, and initialize the layers randomly in a standard Pytorch way (plus, we rescale the weights to match the required layer norm β). For some experiments, we consider deeper networks, half-precision p = 16, downsample not so aggresively, or enforce the input layer weight matrix to have rank 1. |