reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Provable Benefits of Unsupervised Pre-training and Transfer Learning via Single-Index Models

Authors: Taj Jones-Mccormick, Aukosh Jagannath, Subhabrata Sen

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	5.2. Simulations We conduct a few simulations to empirically demonstrate our claims in finite dimensions. In the first simulation, we consider letting f(x) = x3 − 3x and setting λ = 1, η1 = 0.45. We then conduct SGD from both random initializations and from estimates of v obtained via PCA. We use dimension d = 1000 and let SGD run for 3 2d2 = 1, 500, 000 steps of size 1 10d2 = (10, 000, 000)−1. We select the parameters such that we would expect to be able to recover the true parameter vector from a random initialization had we been in the case λ = 0. We determine this scaling based on the results of Ben Arous et al. (2021) and some experimenting. See Figure 1.
Researcher Affiliation	Academia	1Department of Statistics and Actuarial Science, University of Waterloo, Canada 2Cheriton School of Computer Science, University of Waterloo, Canada 3Department of Statistics, Harvard University, United States of America. Correspondence to: Taj Jones Mc Cormick <EMAIL>, Aukosh Jagannath <EMAIL>, Subhabrata Sen <EMAIL>.
Pseudocode	No	The paper describes the Stochastic Gradient Descent (SGD) updates using mathematical equations, but it does not present these as a clearly labeled algorithm block or in pseudocode format. For example, it defines the update rule as: Xt+1 = Xt − δ d ∇L(Xt, y) / \|\|∇L(Xt, y)\|\|, but this is embedded within the text describing the method rather than presented as a distinct algorithm.
Open Source Code	No	The paper does not contain any explicit statement about releasing source code, nor does it provide a link to a code repository in the main text or supplementary sections.
Open Datasets	No	The paper describes generating synthetic data based on single-index models with Gaussian features and spiked covariance for its theoretical analysis and simulations, stating: 'Let the labeled data be (yi, ai)N i=1, with each (yi, ai) independent and identically distributed.' It does not refer to any specific publicly available dataset by name or provide any access information (links, DOIs, or citations to existing public datasets).
Dataset Splits	No	The paper uses synthetic data generated based on a model for its analysis and simulations. While it refers to the 'total number of steps (and samples of (yi, ai)) given by N = αdd', it does not specify explicit training, validation, or test dataset splits. The data is generated on-the-fly for the purpose of the theoretical and simulation analysis without detailing a reproducible splitting methodology for evaluation.
Hardware Specification	No	The paper describes simulations in Section 5.2, but it does not provide any specific details about the hardware used to run these experiments (e.g., GPU models, CPU types, memory specifications). It only mentions the dimension `d = 1000` for the simulations.
Software Dependencies	No	The paper does not provide any specific ancillary software details, such as programming languages, libraries, or solvers with version numbers, that were used for the implementation or experiments.
Experiment Setup	Yes	In Section 5.2 'Simulations', the paper explicitly details parameters for the experiments: 'We consider letting f(x) = x3 - 3x and setting λ = 1, η1 = 0.45. We use dimension d = 1000 and let SGD run for 3/2 d2 = 1, 500, 000 steps of size 1/(10d2) = (10, 000, 000)-1.' It also mentions initializing 'from both random initializations and from estimates of v obtained via PCA.'