Self-Improving Transformers Overcome Easy-to-Hard and Length Generalization Challenges

Authors: Nayoung Lee, Ziyang Cai, Avi Schwarzschild, Kangwook Lee, Dimitris Papailiopoulos

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Across diverse tasks including arithmetic, string manipulation, and maze solving, our method enables models to solve problems far beyond their initial training distribution for instance, generalizing from 10-digit to 100-digit addition without apparent saturation. We observe that filtering for correct self-generated examples leads to exponential improvements in out-of-distribution performance across training rounds.
Researcher Affiliation Collaboration 1University of Wisconsin-Madison 2Carnegie Mellon University 3Microsoft. Correspondence to: Nayoung Lee <EMAIL>, Ziyang Cai <EMAIL>.
Pseudocode Yes Listing 1 Code for the maze format generation used
Open Source Code Yes Detailed dependencies are provided in our github repository2. 2https://github.com/Jack Cai1206/arithmetic-self-improve
Open Datasets No The paper states: "We generate an initial supervised training dataset D0" and describes the data generation process in Appendix C.2, including code for maze generation. No specific external public datasets, links, DOIs, or formal citations for existing open datasets are provided for the experimental data.
Dataset Splits No The paper describes generating synthetic training data (e.g., "We generate an initial supervised training dataset D0..."). For evaluation, it uses progressively harder, out-of-distribution problem instances (e.g., "generalizing from 10-digit to 100-digit addition"). It does not specify fixed training/test/validation splits (percentages or counts) of a single static dataset.
Hardware Specification No The paper mentions running experiments using "Py Torch 2.4 and CUDA 12.1" but does not specify any particular GPU models, CPU models, or other hardware details (e.g., A100, V100, Intel Xeon, TPU type).
Software Dependencies Yes Our experiments are run using Py Torch 2.4 and CUDA 12.1.
Experiment Setup Yes In this section, we provide a detailed overview of the hyperparameter configuration used in our experiments in Table 4 and 5. Table 4 shows the training hyperparameters for the initial training phase on labeled data D0. Table 5 shows the hyperparameters for each the self-improve training rounds on D1,...,R.