How many samples are needed to train a deep neural network?

Authors: Pegah Golestaneh, Mahsa Taheri, Johannes Lederer

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our theoretical and empirical results suggest that the generalization error of Re LU feed-forward neural networks scales at the rate 1/ n in the sample size n rather than the parametric rate 1/n, which could be suggested by traditional statistical theories. Thus, broadly speaking, our results underpin the common belief that neural networks need many training samples. Along the way, we also establish new technical insights, such as the first lower bounds of the entropy of Re LU feed-forward networks. ... In Section 5, we shift our focus to the empirical findings to support our theories. ... This section supports our theoretical findings with simulations on benchmark datasets.
Researcher Affiliation Academia Pegah Golestaneh, Mahsa Taheri & Johannes Lederer Department of Mathematics, Computer Science, and Natural Sciences University of Hamburg EMAIL
Pseudocode No The paper does not contain any clearly labeled pseudocode or algorithm blocks. It describes methods mathematically and textually.
Open Source Code No The paper mentions that "The implementation of these neural networks was carried out using the Tensor Flow library (see Appendix C for further details)." However, this refers to a third-party library used for implementation and not the authors' own source code for the methodology described in the paper. There is no explicit statement about releasing their code or a link to a repository.
Open Datasets Yes For our experiments, we consider both classification and regression tasks. The datasets used include MNIST, Fashion-MNIST and CIFAR10 for classification, and the California Housing Prices (CHP) dataset for regression analysis. ... For example, we imported the Fashion-MNIST dataset from tensorflow.keras.datasets package.
Dataset Splits Yes The MNIST dataset consists of 60 000 training images and 10 000 testing images, each with dimensions of 28 28 pixels. ... The Fashion-MNIST dataset contains 60 000 training images and 10 000 testing images, both with dimensions of 28 28 pixels. ... The CIFAR10 dataset contains 50 000 training images and 10 000 testing images, both with dimensions of 32 32 pixels. ... The version considered in this study comprises 8 numeric input attributes and a dataset of 20 640 samples. These samples were randomly divided into 15 000 for the training data and the remaining for the test data.
Hardware Specification Yes 1. Computer resources: we conducted some of the experiments in Python using Google Colab and some of them using the basic plan of deepnote (https://deepnote.com). For the regression dataset, we used the basic plan of them that utilizes a machine with 5GB RAM and 2v CPU. For the CIFAR10 dataset, we used one of the deepnote s plans that utilizes a machine with 16GB RAM and 4 v CPUs.
Software Dependencies No The implementation of these neural networks was carried out using the Tensor Flow library (see Appendix C for further details). ... Optimizing these parameters is achieved through the Sequential Least Squares Quadratic Programming (SLSQP) method (Kraft, 1988) and the minimize function from scipy.optimize is employed for SLSQP implementation. ... In the training procedure for our experiments, we have used Adam optimization method. The paper mentions software like TensorFlow, Scipy, and Adam optimizer but does not provide specific version numbers for these dependencies.
Experiment Setup Yes We use Cross-entropy (CE) and Mean-squared (MS) error as loss functions for classification and regression datasets, respectively. ... Optimizing these parameters is achieved through the Sequential Least Squares Quadratic Programming (SLSQP) method (Kraft, 1988) and the minimize function from scipy.optimize is employed for SLSQP implementation. The objective function calculates the sum of squared differences between the generalization error of a neural network and two separate curves. ... The batch size for the training samples is set to 20. ... In the training procedure for our experiments, we have used Adam optimization method.