Representations Shape Weak-to-Strong Generalization: Theoretical Insights and Empirical Predictions

Authors: Yihao Xue, Jiping Li, Baharan Mirzasoleiman

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our theory has significant practical implications, providing a representation-based metric that predicts W2SG performance trends without requiring labels, as shown in experiments on molecular predictions with transformers and 5 NLP tasks involving 52 LLMs.
Researcher Affiliation Academia 1Department of Computer Science, University of California, Los Angeles 2Department of Mathematics, University of California, Los Angeles. Correspondence to: Yihao Xue <EMAIL>.
Pseudocode No The paper describes methods using mathematical formulations and theoretical derivations, but no explicitly labeled pseudocode or algorithm blocks are provided.
Open Source Code No The paper does not contain an explicit statement about releasing source code for the methodology described, nor does it provide a direct link to a code repository.
Open Datasets Yes We use the Guaca Mol (Brown et al., 2019) dataset for pretraining both the strong and weak models. For finetuning, we consider three regression datasets ESOL, Free Solv, and Lipop from the Molecule Net (Wu et al., 2018) benchmark, curated by Chem Bench (Charleshen, 2020)... We use the Justice and Commonsense datasets from ETHICS (Hendrycks et al., 2020)... on three datasets: (1) Sci Q (Welbl et al., 2017), containing crowdsourced science exam questions; (2) Amazon Polarity (Zhang et al., 2015), consisting of Amazon reviews; and (3) Cosmos QA (Huang et al., 2019), involving commonsense-based reading comprehension.
Dataset Splits Yes In W2SG, we have two splits of data sampled from D. The first subset, D = {( xi, yi)} n i=1, consists of n i.i.d. samples and is used for finetuning the weak model. The second subset, ˆD = {(ˆxi, ˆyi)}ˆn i=1 with ˆn i.i.d. samples is used for finetuning the strong model. Note that the weak model s outputs will be used as labels in place of the actual ˆyi s... Both data preprocessing and finetuning strictly follow (Burns et al., 2023).
Hardware Specification No The paper mentions conducting experiments with 'transformers' and 'LLMs' but does not specify any particular hardware like GPU models, CPU types, or cloud computing instances with their specifications.
Software Dependencies No The paper mentions using Mol BERT (Fabian et al., 2020), BERT (Devlin, 2018), Adam optimizer (Kingma, 2014), nvidia/NV-Embed-v2 (Lee et al., 2024), and Qwen/Qwen-7B (Bai et al., 2023). However, specific version numbers for software dependencies or libraries are not provided.
Experiment Setup Yes D.1.1. Molecular prediction. ...batch size of 256. For finetuning, we use SGD to train a linear model on representations with the following settings: batch size = 1024, learning rate = 0.001, weight decay = 0.1, and epochs = 2000... D.1.2. NLP tasks with embedding models. ...batch size = 200, learning rate = 0.01, weight decay = 0.00001, and epochs = 200. D.1.3. NLP tasks with end-to-end finetuned LLMs. ...batch size of 32 and trains for a single epoch. The learning rate is set to 5e-5 for weak models and 1e-5 for the strong model...