reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

A transfer learning framework for weak to strong generalization

Authors: Seamus Somerstep, Felipe Maia Polo, Moulinath Banerjee, Yaacov Ritov, Mikhail Yurochkin, Yuekai Sun

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this section, we validate the methods suggested by our framework. Following the analogy for superalignment in Burns et al. (2023a), we use a smaller LLM to generate the weak labels for the purpose of training a larger LLM: the smaller LLM is the analog of human supervision in superalignment. For each experiment, additional details are provided in Appendix F. (...) Evaluation: In the persona experiment, the fine-tuned strong model (GPT-3.5-Turbo) is evaluated on the tiny versions of Alpaca Eval 2.0 and Truthful QA (Maia Polo et al., 2024). The tiny versions of these benchmarks are composed of 100 curated questions that capture the diversity present in the full datasets. (...) Results: Figures 1, 2, and 4 provide an empirical demonstration of the findings of our transfer learning framework.
Researcher Affiliation	Collaboration	University of Michigan IBM Research MIT-IBM Watson AI Lab. The authors are affiliated with the University of Michigan (an academic institution) and IBM Research / MIT-IBM Watson AI Lab (industry/corporate entities), indicating a collaborative effort.
Pseudocode	Yes	Algorithm 1 ICL Refinement Algorithm 2 Infer-and-Respond Algorithm 3 Ask-to-Improve label improvement
Open Source Code	No	The paper does not provide any explicit statement about releasing source code for their methodology, nor does it include links to a code repository.
Open Datasets	Yes	In the persona experiment, the strong models are fine-tuned using questions selected from the Dolly (Conover et al., 2023) data set. In the mathematical reasoning experiment, the training data comes from either the gsm8k (Cobbe et al., 2021) data set or the MATH (Hendrycks et al., 2021) data set. In the explanation technique experiment, the training/test set is a set of science questions provided by GPT4 (Achiam et al., 2024).
Dataset Splits	No	The paper mentions using specific datasets for training and evaluation, and states that the test sets for Alpaca Eval 2.0 and Truthful QA are "100 curated questions". However, it does not provide explicit training, validation, and test split percentages or sample counts for the overall datasets used in their experiments, nor does it refer to predefined splits with specific citations that would allow for reproduction of the data partitioning.
Hardware Specification	Yes	All experimental steps done with weaker models (Falcon and Llama) were done on a computing cluster with two 16 GB v100 GPU s.
Software Dependencies	No	The paper mentions using various LLMs by their names and versions (e.g., GPT-3.5-Turbo-0125, GPT-4o-mini-2024-07-18, Falcon-7B-Instruct, Llama-2-7B-Chat, Mistral-7B, Gemma-1.2B) and the "Open AI interface". However, it does not specify any programming languages, frameworks (like PyTorch or TensorFlow), or other software libraries with their version numbers that are critical for reproducing the experimental setup.
Experiment Setup	Yes	No validation data is used, and for the system prompt we use a generic "You are an AI assistant. Your task is to respond to questions or instructions." (...) In the ICL we use five in-context examples at a time. (...) For gsm8k 3 examples are used, while for MATH two examples are used.