reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Representations Shape Weak-to-Strong Generalization: Theoretical Insights and Empirical Predictions

Authors: Yihao Xue, Jiping Li, Baharan Mirzasoleiman

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our theory has significant practical implications, providing a representation-based metric that predicts W2SG performance trends without requiring labels, as shown in experiments on molecular predictions with transformers and 5 NLP tasks involving 52 LLMs.
Researcher Affiliation	Academia	1Department of Computer Science, University of California, Los Angeles 2Department of Mathematics, University of California, Los Angeles. Correspondence to: Yihao Xue <EMAIL>.
Pseudocode	No	The paper describes methods using mathematical formulations and theoretical derivations, but no explicitly labeled pseudocode or algorithm blocks are provided.
Open Source Code	No	The paper does not contain an explicit statement about releasing source code for the methodology described, nor does it provide a direct link to a code repository.
Open Datasets	Yes	We use the Guaca Mol (Brown et al., 2019) dataset for pretraining both the strong and weak models. For finetuning, we consider three regression datasets ESOL, Free Solv, and Lipop from the Molecule Net (Wu et al., 2018) benchmark, curated by Chem Bench (Charleshen, 2020)... We use the Justice and Commonsense datasets from ETHICS (Hendrycks et al., 2020)... on three datasets: (1) Sci Q (Welbl et al., 2017), containing crowdsourced science exam questions; (2) Amazon Polarity (Zhang et al., 2015), consisting of Amazon reviews; and (3) Cosmos QA (Huang et al., 2019), involving commonsense-based reading comprehension.
Dataset Splits	Yes	In W2SG, we have two splits of data sampled from D. The first subset, D = {( xi, yi)} n i=1, consists of n i.i.d. samples and is used for finetuning the weak model. The second subset, ˆD = {(ˆxi, ˆyi)}ˆn i=1 with ˆn i.i.d. samples is used for finetuning the strong model. Note that the weak model s outputs will be used as labels in place of the actual ˆyi s... Both data preprocessing and finetuning strictly follow (Burns et al., 2023).
Hardware Specification	No	The paper mentions conducting experiments with 'transformers' and 'LLMs' but does not specify any particular hardware like GPU models, CPU types, or cloud computing instances with their specifications.
Software Dependencies	No	The paper mentions using Mol BERT (Fabian et al., 2020), BERT (Devlin, 2018), Adam optimizer (Kingma, 2014), nvidia/NV-Embed-v2 (Lee et al., 2024), and Qwen/Qwen-7B (Bai et al., 2023). However, specific version numbers for software dependencies or libraries are not provided.
Experiment Setup	Yes	D.1.1. Molecular prediction. ...batch size of 256. For finetuning, we use SGD to train a linear model on representations with the following settings: batch size = 1024, learning rate = 0.001, weight decay = 0.1, and epochs = 2000... D.1.2. NLP tasks with embedding models. ...batch size = 200, learning rate = 0.01, weight decay = 0.00001, and epochs = 200. D.1.3. NLP tasks with end-to-end finetuned LLMs. ...batch size of 32 and trains for a single epoch. The learning rate is set to 5e-5 for weak models and 1e-5 for the strong model...