Learning Hierarchical Polynomials of Multiple Nonlinear Features
Authors: Hengyu Fu, Zihao Wang, Eshaan Nichani, Jason Lee
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | A NUMERICAL EXPERIMENTS We empirically verify Theorem 1 and Proposition 1. [...] The left panel of Figure 2 demonstrates that our model outperforms the naive random-feature model across all dimensions. |
| Researcher Affiliation | Academia | Peking University. Email: EMAIL Stanford University. Email: EMAIL Princeton University. Email: EMAIL |
| Pseudocode | Yes | Algorithm 1 Layer-wise training algorithm |
| Open Source Code | No | No explicit statement about code availability or a repository link was found in the paper. |
| Open Datasets | No | Data distribution Our aim is to learn the target function f : X R, with X Rd being the input space. Throughout the paper, we assume X = Sd 1(d), that is, the sphere with radius d in d dimensions. Also, we consider the data distribution to be the uniform distribution on the sphere, i.e., x Unif(X), and we draw two independent datasets D1, D2, each with n1 and n2 i.i.d. samples, respectively. |
| Dataset Splits | Yes | Data distribution [...] we draw two independent datasets D1, D2, each with n1 and n2 i.i.d. samples, respectively. Thus, we draw n1 + n2 samples in total. [...] Training Algorithm Following Nichani et al. (2023), our network is trained via layer-wise gradient descent with sample splitting. [...] Algorithm 1 Layer-wise training algorithm Input: Learning rates η1, η2, weight decay λ1, λ2, parameter ϵ, number of steps T [...] 2 train W on dataset D1 [...] 8 train a on dataset D2 |
| Hardware Specification | No | No specific hardware details (like exact GPU/CPU models, processor types, or memory amounts) were found in the paper's experimental setup or any other section. |
| Software Dependencies | No | No specific software dependencies with version numbers (e.g., library or solver names with versions) were found in the paper. |
| Experiment Setup | Yes | Input: Learning rates η1, η2, weight decay λ1, λ2, parameter ϵ, number of steps T [...] We initialize each row of V to be drawn uniformly on the sphere of radius d [...] For the network architecture, we choose σ1 as per (2) and σ2 = Q2, with network sizes set to m1 = 10000 and m2 = 20000. [...] For the right panel, we conduct transfer learning with n1 = 216 pretraining samples and plot the dependence on n2. The figure reports the mean and normalized standard error of the test error using 10,000 fresh samples, based on 5 independent experimental instances. |