Statistical Guarantees for Approximate Stationary Points of Shallow Neural Networks
Authors: Mahsa Taheri, Fang Xie, Johannes Lederer
TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We provide here some numerical observations to clarify theories of Section 2 and Section 3. We minimize a least-squares complemented by ℓ1-regularization for shallow neural networks with linear and Re LU activation functions. We set our tuning parameter on the order of log(np)/ n based on our experiments. We consider neural networks with d = w = 10, that are trained over 500 and tested over 300 data sample generated from a standard normal distribution and labeled by a sparse-target network (having the same structure as the considered model) plus a Gaussian noise. ... We report the relative training error and the relative test error for a potential global optimum, an approximate stationary point, and a randomly generated network... |
| Researcher Affiliation | Academia | Mahsa Taheri EMAIL Department of Mathematics University of Hamburg; Fang Xie EMAIL Guangdong Provincial Key Laboratory of IRADS Beijing Normal-Hong Kong Baptist University; Johannes Lederer EMAIL Department of Mathematics University of Hamburg |
| Pseudocode | No | The paper describes mathematical derivations and theoretical proofs, but it does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks or sections with structured, code-like steps. |
| Open Source Code | No | The paper does not contain an unambiguous statement that the authors are releasing their code for the methodology described, nor does it provide a direct link to a source-code repository. |
| Open Datasets | Yes | We applied our method to the MNIST, fashion-MNIST, and K-MNIST dataset using cross-entropy loss, with a neural network consisting of 10-layer weight matrices and Re LU activations, with network width 50. |
| Dataset Splits | Yes | We consider neural networks with d = w = 10, that are trained over 500 and tested over 300 data sample generated from a standard normal distribution and labeled by a sparse-target network... |
| Hardware Specification | Yes | All the simulations were executed on a local computer (Apple M2, 16GB memory), with an average run time of less than 10 minutes per individual run in Python. |
| Software Dependencies | No | We use PyTorch s default initialization, where weights are drawn from a uniform distribution in [ 1/ p, 1/ p]... We use stochastic gradient descent with a small convergence threshold... All the simulations were executed... in Python. For optimization, we employed SGD with the learning rate 0.02. ... Specifically, we replaced SGD with Adam, using a learning rate of 0.005... |
| Experiment Setup | Yes | We set our tuning parameter on the order of log(np)/ n based on our experiments. ... We use stochastic gradient descent with a small convergence threshold to ensure that the optimization process does not stop early. ... We use PyTorch s default initialization, where weights are drawn from a uniform distribution in [ 1/ p, 1/ p]... For optimization, we employed SGD with the learning rate 0.02. ... Specifically, we replaced SGD with Adam, using a learning rate of 0.005... |