Sharper Guarantees for Learning Neural Network Classifiers with Gradient Methods

Authors: Hossein Taheri, Christos Thrampoulidis, Arya Mazumdar

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this section, we present numerical results on the behavior of the generalization bound derived in Theorem 2.1 for real-world data (Fashion MNIST and MNIST datasets) and compare it with the empirical generalization gap. Experiments on learning under NTK with small step-size. In this section, we present numerical results on the behavior of the generalization bound derived in Theorem 2.1 for real-world data (Fashion MNIST and MNIST datasets) and compare it with the empirical generalization gap. Experiments on learning the XOR distribution with large step-size. Figure 4 demonstrates the test error curves associated with learning the XOR distribution according to the setting of Theorem 2.4.
Researcher Affiliation Academia Hossein Taheri Department of Computer Science and Engineering, University of California, San Diego. EMAIL Christos Thrampoulidis, Department of Electrical and Computer Engineering, University of British Columbia. EMAIL Arya Mazumdar Department of Computer Science and Engineering, University of California, San Diego. EMAIL
Pseudocode No The paper describes algorithms in text, e.g., "wt+1 = wt r b F(wt)" for gradient descent and "wt+1 = wt b F(wt)" for mini-batch SGD, but does not include any clearly labeled pseudocode or algorithm blocks.
Open Source Code No The paper does not contain any statements about releasing code or links to a code repository.
Open Datasets Yes In this section, we present numerical results on the behavior of the generalization bound derived in Theorem 2.1 for real-world data (Fashion MNIST and MNIST datasets) and compare it with the empirical generalization gap. Experiments on learning the XOR distribution with large step-size. Figure 4 demonstrates the test error curves associated with learning the XOR distribution according to the setting of Theorem 2.4.
Dataset Splits No Figure 1: Iteration-based distance from initialization (kwt w0k), training loss, test loss and generalization gap (i.e., test loss train loss) for training a two hidden-layer neural network with Fashion MNIST dataset and two choices of step-size. Here n = 12 103, m = 500, and total number of parameters p 6 105. Figure 2: Iteration-based distance from initialization, training loss, test loss and generalization gap for training a two hidden-layer neural network with Fashion MNIST dataset and m = 250, 500. Here n = 4 103, p 2 105(blue line), 6 105 (red line), and = 0.02. Figure 3: Iteration-based distance from initialization, training loss, test loss and generalization gap for training a two hidden-layer neural network with MNIST dataset and m = 300, 600. Here n = 2 103, p 3 105(blue line), 8 105 (red line) and = 0.02. Figure 4: Left: Misclassification error based on iteration in learning the d dimensional XOR distribution with SGD. Right: Total number of SGD steps based on data dimension to reach approximately zero test error. In particular, we fix n = 6d, = m = 20 and set the total number of SGD steps as T = dlog(d)e. Note that the number of iterations required to reach perfect accuracy grows with d. The right side of Figure 4 provides further insight into the relationship between dimensionality and convergence rate. It displays the total number of SGD steps required to reach a test error below 0.01 for different values of d using n = 3d, m = = 20.
Hardware Specification No The paper does not provide specific hardware details used for running its experiments.
Software Dependencies No The paper does not provide specific ancillary software details with version numbers.
Experiment Setup Yes We consider binary classification with a 2-hidden layer network with softplus activation (σ(t) = log(1 + et)) trained by the logistic loss function. Figure 1 presents train, test and generalization behavior of GD for learning a such a model with Fashion MNIST dataset. The two lines in each figure correspond to = 0.01, 0.1. Experiments on learning the XOR distribution with large step-size. Figure 4 demonstrates the test error curves associated with learning the XOR distribution according to the setting of Theorem 2.4. In particular, we fix n = 6d, = m = 20 and set the total number of SGD steps as T = dlog(d)e.