Statistical Guarantees for Approximate Stationary Points of Shallow Neural Networks

Authors: Mahsa Taheri, Fang Xie, Johannes Lederer

TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We provide here some numerical observations to clarify theories of Section 2 and Section 3. We minimize a least-squares complemented by ℓ1-regularization for shallow neural networks with linear and Re LU activation functions. We set our tuning parameter on the order of log(np)/ n based on our experiments. We consider neural networks with d = w = 10, that are trained over 500 and tested over 300 data sample generated from a standard normal distribution and labeled by a sparse-target network (having the same structure as the considered model) plus a Gaussian noise. ... We report the relative training error and the relative test error for a potential global optimum, an approximate stationary point, and a randomly generated network...
Researcher Affiliation Academia Mahsa Taheri EMAIL Department of Mathematics University of Hamburg; Fang Xie EMAIL Guangdong Provincial Key Laboratory of IRADS Beijing Normal-Hong Kong Baptist University; Johannes Lederer EMAIL Department of Mathematics University of Hamburg
Pseudocode No The paper describes mathematical derivations and theoretical proofs, but it does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks or sections with structured, code-like steps.
Open Source Code No The paper does not contain an unambiguous statement that the authors are releasing their code for the methodology described, nor does it provide a direct link to a source-code repository.
Open Datasets Yes We applied our method to the MNIST, fashion-MNIST, and K-MNIST dataset using cross-entropy loss, with a neural network consisting of 10-layer weight matrices and Re LU activations, with network width 50.
Dataset Splits Yes We consider neural networks with d = w = 10, that are trained over 500 and tested over 300 data sample generated from a standard normal distribution and labeled by a sparse-target network...
Hardware Specification Yes All the simulations were executed on a local computer (Apple M2, 16GB memory), with an average run time of less than 10 minutes per individual run in Python.
Software Dependencies No We use PyTorch s default initialization, where weights are drawn from a uniform distribution in [ 1/ p, 1/ p]... We use stochastic gradient descent with a small convergence threshold... All the simulations were executed... in Python. For optimization, we employed SGD with the learning rate 0.02. ... Specifically, we replaced SGD with Adam, using a learning rate of 0.005...
Experiment Setup Yes We set our tuning parameter on the order of log(np)/ n based on our experiments. ... We use stochastic gradient descent with a small convergence threshold to ensure that the optimization process does not stop early. ... We use PyTorch s default initialization, where weights are drawn from a uniform distribution in [ 1/ p, 1/ p]... For optimization, we employed SGD with the learning rate 0.02. ... Specifically, we replaced SGD with Adam, using a learning rate of 0.005...