Learning Hierarchical Polynomials of Multiple Nonlinear Features

Authors: Hengyu Fu, Zihao Wang, Eshaan Nichani, Jason Lee

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental A NUMERICAL EXPERIMENTS We empirically verify Theorem 1 and Proposition 1. [...] The left panel of Figure 2 demonstrates that our model outperforms the naive random-feature model across all dimensions.
Researcher Affiliation Academia Peking University. Email: EMAIL Stanford University. Email: EMAIL Princeton University. Email: EMAIL
Pseudocode Yes Algorithm 1 Layer-wise training algorithm
Open Source Code No No explicit statement about code availability or a repository link was found in the paper.
Open Datasets No Data distribution Our aim is to learn the target function f : X R, with X Rd being the input space. Throughout the paper, we assume X = Sd 1(d), that is, the sphere with radius d in d dimensions. Also, we consider the data distribution to be the uniform distribution on the sphere, i.e., x Unif(X), and we draw two independent datasets D1, D2, each with n1 and n2 i.i.d. samples, respectively.
Dataset Splits Yes Data distribution [...] we draw two independent datasets D1, D2, each with n1 and n2 i.i.d. samples, respectively. Thus, we draw n1 + n2 samples in total. [...] Training Algorithm Following Nichani et al. (2023), our network is trained via layer-wise gradient descent with sample splitting. [...] Algorithm 1 Layer-wise training algorithm Input: Learning rates η1, η2, weight decay λ1, λ2, parameter ϵ, number of steps T [...] 2 train W on dataset D1 [...] 8 train a on dataset D2
Hardware Specification No No specific hardware details (like exact GPU/CPU models, processor types, or memory amounts) were found in the paper's experimental setup or any other section.
Software Dependencies No No specific software dependencies with version numbers (e.g., library or solver names with versions) were found in the paper.
Experiment Setup Yes Input: Learning rates η1, η2, weight decay λ1, λ2, parameter ϵ, number of steps T [...] We initialize each row of V to be drawn uniformly on the sphere of radius d [...] For the network architecture, we choose σ1 as per (2) and σ2 = Q2, with network sizes set to m1 = 10000 and m2 = 20000. [...] For the right panel, we conduct transfer learning with n1 = 216 pretraining samples and plot the dependence on n2. The figure reports the mean and normalized standard error of the test error using 10,000 fresh samples, based on 5 independent experimental instances.