reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Self-Improving Transformers Overcome Easy-to-Hard and Length Generalization Challenges

Authors: Nayoung Lee, Ziyang Cai, Avi Schwarzschild, Kangwook Lee, Dimitris Papailiopoulos

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Across diverse tasks including arithmetic, string manipulation, and maze solving, our method enables models to solve problems far beyond their initial training distribution for instance, generalizing from 10-digit to 100-digit addition without apparent saturation. We observe that filtering for correct self-generated examples leads to exponential improvements in out-of-distribution performance across training rounds.
Researcher Affiliation	Collaboration	1University of Wisconsin-Madison 2Carnegie Mellon University 3Microsoft. Correspondence to: Nayoung Lee <EMAIL>, Ziyang Cai <EMAIL>.
Pseudocode	Yes	Listing 1 Code for the maze format generation used
Open Source Code	Yes	Detailed dependencies are provided in our github repository2. 2https://github.com/Jack Cai1206/arithmetic-self-improve
Open Datasets	No	The paper states: "We generate an initial supervised training dataset D0" and describes the data generation process in Appendix C.2, including code for maze generation. No specific external public datasets, links, DOIs, or formal citations for existing open datasets are provided for the experimental data.
Dataset Splits	No	The paper describes generating synthetic training data (e.g., "We generate an initial supervised training dataset D0..."). For evaluation, it uses progressively harder, out-of-distribution problem instances (e.g., "generalizing from 10-digit to 100-digit addition"). It does not specify fixed training/test/validation splits (percentages or counts) of a single static dataset.
Hardware Specification	No	The paper mentions running experiments using "Py Torch 2.4 and CUDA 12.1" but does not specify any particular GPU models, CPU models, or other hardware details (e.g., A100, V100, Intel Xeon, TPU type).
Software Dependencies	Yes	Our experiments are run using Py Torch 2.4 and CUDA 12.1.
Experiment Setup	Yes	In this section, we provide a detailed overview of the hyperparameter configuration used in our experiments in Table 4 and 5. Table 4 shows the training hyperparameters for the initial training phase on labeled data D0. Table 5 shows the hyperparameters for each the self-improve training rounds on D1,...,R.