reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Discrepancies are Virtue: Weak-to-Strong Generalization through Lens of Intrinsic Dimension

Authors: Yijun Dong, Yicheng Li, Yunai Li, Jason D. Lee, Qi Lei

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	4. Experiments We conduct experiments to validate the theoretical findings on both synthetic and real tasks. In this section, we focus on two illustrative settings: synthetic regression (Section 4.1) and real-world image regression (Section 4.2). For brevity, we defer more experiments on image and sentiment classification tasks to Appendices E.2 and E.3, respectively.
Researcher Affiliation	Academia	1New York University 2Shanghai Jiaotong University 3Princeton University. Correspondence to: Yijun Dong <EMAIL>, Qi Lei <EMAIL>.
Pseudocode	No	The paper describes methods and analyses using mathematical formulations and textual explanations but does not include any clearly labeled pseudocode or algorithm blocks.
Open Source Code	No	The paper does not contain any explicit statements about the release of source code, nor does it provide links to code repositories or mention code in supplementary materials.
Open Datasets	Yes	4.2. UTKFace regression Beyond the synthetic regression, we investigate W2S on a real-world image regression task age estimation on the UTKFace dataset (Zhang et al., 2017).
Dataset Splits	Yes	UTKFace (Aligned & Cropped) (Zhang et al., 2017) consists of 23, 708 face images with age labels... We preprocess the images... and split the dataset into training and testing sets of sizes 20, 000 and 3, 708.
Hardware Specification	No	The experiments are supported by the PLI computing cluster. YD acknowledges support of NYU Courant Instructorship. JDL acknowledges support of Open Philanthropy, NSF IIS 2107304, NSF CCF 2212262, NSF CA- REER Award 2144994, and NSF CCF 2019844. This material is based upon work supported by the U.S. Department of Energy, Office of Science Energy Earthshot Initiative as part of the project Learning reduced models under extreme data conditions for design and rapid decision-making in complex systems under Award #DE-SC0024721.
Software Dependencies	No	We use ridge regression with a small fixed regularization hyperparameter αw, αw2s, αs, αc = 10^-6, close to the machine epsilon of single precision floating point numbers. ... We train the models with cross-entropy loss and Adam W optimizer. ... All training is conducted via Adam optimizers (Kingma & Ba, 2014) with a learning rate of 5e-5, a cosine learning rate schedule, and 40 warmup steps.
Experiment Setup	Yes	We use ridge regression with a small fixed regularization hyperparameter αw, αw2s, αs, αc = 10^-6, close to the machine epsilon of single precision floating point numbers. ... We preprocess the images to 224 x 224 pixels and split the dataset into training and testing sets of sizes 20, 000 and 3, 708. ... We train the models with cross-entropy loss and Adam W optimizer. We tune the training hyperparameters of weak and strong models using a validation set and train them for 800 steps with a learning rate 1e-3 and weight decay 1e-6. ... All training is conducted via Adam optimizers (Kingma & Ba, 2014) with a learning rate of 5e-5, a cosine learning rate schedule, and 40 warmup steps. We train for 3 epochs, which is sufficient for the train and validation losses to stabilize.