Learning with Norm Constrained, Over-parameterized, Two-layer Neural Networks

Authors: Fanghui Liu, Leello Dadi, Volkan Cevher

JMLR 2024 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this section, we conduct numerical experiments to validate our theoretical results in the perspective of the convergence rate of the excess risk. To validate whether the derived (sharper) convergence rate is attainable or not, we construct a simple synthetic dataset under a known fρ in the over-parameterized regime. To be specific, we assume that the data are sampled from a normal Gaussian distribution, i.e., x N(0, Id) and normalized with x 2 = 1. The feature dimension is d = 3, a low dimension setting to ensure P in Theorem 14 is not large as mentioned before. We set the number of training points to range from 10 to 1000 while the number of test points is held fixed at 20. Albeit simple, such an experimental setting still works in the over-parameterized regime, see Table 1 (Left). We consider the noiseless case, where the target function is generated by a single Re LU, i.e., y = fρ(x) = σ( w , x ) with w N(0, Id). The regularization parameter is set to λ = 10 8 for both two methods, kernel ridge regression via the NTK and the path norm based algorithm. We solve the convex program in Eq. (17) using CVX (Grant and Boyd, 2014) to obtain the exact global minima and then compute the test MSE for regression over 5 runs. The (middle) figure of Table 1 shows that, when learning a single Re LU beyond RKHS, our algorithm still achieves the same convergence rate as the NTK in RKHS regime. This is because, the input dimension d = 3 is not large, so there is no significant difference on the convergence rate. Besides, we also conduct this experiment on a real-world dataset, i.e., the UCI ML Breast Cancer dataset with 569 samples and the dimension d = 30. We set 80% of samples used for training and 20% of samples for test. Here the number of training data ranges from 40 to 300, and the number of test data ranges from 10 to 75, accordingly. The remaining experimental setting is the same as that of the synthetic dataset. The (right) figure of Table 1 shows that, when increasing the number of training data, the test MSE of NTK slightly decreases. Instead, the path norm based algorithm achieves a significant lower test MSE, which demonstrates the attainability of our theoretical results. Nevertheless, we also need to point out that, the path norm based algorithm is quite inefficient and unstable when compared to NTK. The performance is based on an extreme accurate solution by CVX, which restricts the utility of this convex program algorithm in practice. Additionally, we remark here that we do not claim this algorithm is better than SGD.
Researcher Affiliation Academia Fanghui Liu EMAIL Department of Computer Science, University of Warwick, Coventry, UK Leello Dadi EMAIL Lab for Information and Inference Systems, Ecole Polytechnique F ed erale de Lausanne (EPFL), Switzerland Volkan Cevher EMAIL Lab for Information and Inference Systems, Ecole Polytechnique F ed erale de Lausanne (EPFL), Switzerland
Pseudocode No The paper describes a 'computational algorithm' in Section 4.4 and Appendix B, with mathematical formulations of optimization problems (e.g., Eq. 10 and 17). However, it does not provide a clearly labeled 'Pseudocode' or 'Algorithm' block with structured, step-by-step instructions typical of pseudocode.
Open Source Code No The paper mentions using 'CVX (Grant and Boyd, 2014)' in the numerical validation section, which is a third-party tool. There is no explicit statement about releasing the authors' own implementation code, nor any links to a code repository.
Open Datasets Yes Besides, we also conduct this experiment on a real-world dataset, i.e., the UCI ML Breast Cancer dataset with 569 samples and the dimension d = 30.
Dataset Splits Yes We set the number of training points to range from 10 to 1000 while the number of test points is held fixed at 20. [...] We set 80% of samples used for training and 20% of samples for test. Here the number of training data ranges from 40 to 300, and the number of test data ranges from 10 to 75, accordingly.
Hardware Specification No The paper includes a section on 'Numerical Validation' but does not specify any hardware components such as GPU/CPU models, memory, or specific computing environments used for the experiments.
Software Dependencies No We solve the convex program in Eq. (17) using CVX (Grant and Boyd, 2014). The paper mentions CVX but does not provide a version number for it or any other software used.
Experiment Setup Yes The regularization parameter is set to λ = 10 8 for both two methods, kernel ridge regression via the NTK and the path norm based algorithm. We set the number of training points to range from 10 to 1000 while the number of test points is held fixed at 20. [...] We set 80% of samples used for training and 20% of samples for test. Here the number of training data ranges from 40 to 300, and the number of test data ranges from 10 to 75, accordingly.