reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Provable weak-to-strong generalization via benign overfitting

Authors: David Wu, Anant Sahai

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We theoretically investigate weak-to-strong generalization for binary and multilabel classification in a stylized overparameterized spiked covariance model with Gaussian covariates where the weak teacher s pseudolabels are asymptotically like random guessing. Under these assumptions, we provably identify two asymptotic phases of the strong student s generalization after weak supervision: (1) successful generalization and (2) random guessing. Our techniques should eventually extend to weak-to-strong multiclass classification. Towards doing so, we prove a tight lower tail inequality for the maximum of correlated Gaussians, which may be of independent interest. ... The regimes where our theorem applies are depicted in Figures 2a and 2b, and we validated our theory with numerical simulations of MNI with n = 50 in Figures 2c and 2d; see Appendix F for more details on the experiments.
Researcher Affiliation	Academia	David X. Wu Department of EECS UC Berkeley Berkeley, CA 94720 EMAIL Anant Sahai Department of EECS UC Berkeley Berkeley, CA 94720 EMAIL
Pseudocode	Yes	Procedure 1 (Weak-to-strong training). The weak learner observes an initial dataset of n datapoints (exi,weak, yi)i [n], where exi,weak are the weak features for the ith datapoint and yi = sgn( gi, v ) is the corresponding clean hard label. We train fweak Rdweak using MNI on these n clean datapoints. Then, both learners observe m extra unlabeled datapoints, where the weak model sees weak features (xj,weak)j [m] and the strong model sees the corresponding strong features (xj,strong)j [m]. Generate m hard pseudolabels via byj,weak = sgn( fweak, xj,weak ), and use MNI to train fw2s Rd on (xj,strong, byj,weak)j [m].
Open Source Code	No	The paper does not contain any explicit statement about providing open-source code, nor does it provide a link to a code repository.
Open Datasets	No	We generated Gaussian data following the subset ensembles specified in the figures, and constructed two linear models from them: the MNI classifier and the simple averaging classifier.
Dataset Splits	No	The paper specifies using 'n labeled datapoints' for training fweak, 'm = nu unlabeled datapoints' for training fw2s, and 'ntest = 100 fresh datapoints' for evaluation. However, it does not explicitly provide traditional train/test/validation split percentages or specific counts for a single, pre-existing dataset. The data is generated for the simulations, rather than being split from an existing dataset.
Hardware Specification	No	The paper does not contain any specific details about the hardware (e.g., GPU models, CPU types, memory) used for running the simulations.
Software Dependencies	No	The paper does not contain any specific details about software dependencies, such as programming languages or library versions, used for implementing the methods or running the simulations.
Experiment Setup	Yes	We ran 8 independent trials to train fweak with n = 50 so that we could explore how the weak-to-strong behavior scales with p and u. For each fweak, we conducted 16 independent trials to train fw2s. We swept out u using five equally spaced points in [1, 1.3]. In Figures 3 and 4, we show the results of the averaging and MNI experiments, respectively, for four different slices.