Self-Improving Transformers Overcome Easy-to-Hard and Length Generalization Challenges
Authors: Nayoung Lee, Ziyang Cai, Avi Schwarzschild, Kangwook Lee, Dimitris Papailiopoulos
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Across diverse tasks including arithmetic, string manipulation, and maze solving, our method enables models to solve problems far beyond their initial training distribution for instance, generalizing from 10-digit to 100-digit addition without apparent saturation. We observe that filtering for correct self-generated examples leads to exponential improvements in out-of-distribution performance across training rounds. |
| Researcher Affiliation | Collaboration | 1University of Wisconsin-Madison 2Carnegie Mellon University 3Microsoft. Correspondence to: Nayoung Lee <EMAIL>, Ziyang Cai <EMAIL>. |
| Pseudocode | Yes | Listing 1 Code for the maze format generation used |
| Open Source Code | Yes | Detailed dependencies are provided in our github repository2. 2https://github.com/Jack Cai1206/arithmetic-self-improve |
| Open Datasets | No | The paper states: "We generate an initial supervised training dataset D0" and describes the data generation process in Appendix C.2, including code for maze generation. No specific external public datasets, links, DOIs, or formal citations for existing open datasets are provided for the experimental data. |
| Dataset Splits | No | The paper describes generating synthetic training data (e.g., "We generate an initial supervised training dataset D0..."). For evaluation, it uses progressively harder, out-of-distribution problem instances (e.g., "generalizing from 10-digit to 100-digit addition"). It does not specify fixed training/test/validation splits (percentages or counts) of a single static dataset. |
| Hardware Specification | No | The paper mentions running experiments using "Py Torch 2.4 and CUDA 12.1" but does not specify any particular GPU models, CPU models, or other hardware details (e.g., A100, V100, Intel Xeon, TPU type). |
| Software Dependencies | Yes | Our experiments are run using Py Torch 2.4 and CUDA 12.1. |
| Experiment Setup | Yes | In this section, we provide a detailed overview of the hyperparameter configuration used in our experiments in Table 4 and 5. Table 4 shows the training hyperparameters for the initial training phase on labeled data D0. Table 5 shows the hyperparameters for each the self-improve training rounds on D1,...,R. |