Directional Convergence Near Small Initializations and Saddles in Two-Homogeneous Neural Networks

Authors: Akshay Kumar, Jarvis Haupt

TMLR 2024 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental For illustration, we provide a brief toy example showing the phenomenon of directional convergence near small initialization. We train a single-layer squared Re LU neural network using gradient descent and small initialization, and provide in Figure 1 a visual depiction of (a) the overall loss and the ℓ2 norm of the network weights, and (b) the angle the weight vectors make with the positive horizontal axis, all as a function of the number of training iterations. (See the figure caption for more specific experimental details.)
Researcher Affiliation Academia Akshay Kumar EMAIL Department of Electrical and Computer Engineering University of Minnesota, Minneapolis, MN Jarvis Haupt EMAIL Department of Electrical and Computer Engineering University of Minnesota, Minneapolis, MN
Pseudocode No The paper describes methods and proofs using mathematical equations and lemmas, but it does not include any sections explicitly labeled 'Pseudocode' or 'Algorithm,' nor does it present any structured code-like procedures.
Open Source Code No The paper does not contain any explicit statements about releasing source code, nor does it provide links to a code repository or supplementary materials containing code.
Open Datasets No For training, we use 50 unit norm inputs and corresponding labels are generated using the function H (x1, x2) = 5 max(0, x1)2 + 4 max(0, x1)2. We use square loss and optimize using gradient descent for 50000 iterations with step-size 5e-5. At initialization, the weights of each hidden neuron are drawn from Gaussian distribution with standard deviation 10e-5.
Dataset Splits No The paper describes a generated dataset of '50 unit norm inputs' for illustrative toy examples, but it does not specify any explicit training, validation, or test splits for this data.
Hardware Specification No The paper does not specify any hardware details such as GPU models, CPU types, or other computational resources used for the experiments.
Software Dependencies No The paper does not specify any software versions for libraries, frameworks, or programming languages used in the experiments.
Experiment Setup Yes We use square loss and optimize using gradient descent for 50000 iterations with step-size 5e-5. At initialization, the weights of each hidden neuron are drawn from Gaussian distribution with standard deviation 10e-5.