Universal Length Generalization with Turing Programs
Authors: Kaiying Hou, David Brandfonbrener, Sham M. Kakade, Samy Jelassi, Eran Malach
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We show that by using Turing Programs, we obtain robust length generalization on a range of algorithmic tasks: addition, multiplication and in-context SGD. We then demonstrate that transformers achieve length generalization on random Turing Programs, suggesting that length generalization is possible for any algorithmic task. |
| Researcher Affiliation | Academia | 1Department of Mathematics, Harvard University, Cambridge, MA, United States 2Department of Mathematics, UC Berkeley, Berkeley, CA, United States 3Kempner Institute, Harvard University, Cambridge, MA, United States 4Center of Mathematical Sciences and Applications, Harvard University, Cambridge, MA, United States. |
| Pseudocode | Yes | In this work, we construct a simple RASP program that generates Turing Programs to simulate arbitrary Turing machines. D. RASP Turing Programs D.1. RASP Python Definitions (from (Zhou et al., 2023)) import numpy as np def full(x, const): return np.full_like(x, const, dtype=int) ... (Full RASP Python definitions and utility functions are provided in Appendix D.1-D.5) |
| Open Source Code | No | The paper does not contain any explicit statements about releasing source code for their methodology, nor does it provide links to a code repository. |
| Open Datasets | No | Data. We adopt the scratchpad format and write all the steps into one sequence, where steps are separated by a separator token. Figure 2 shows our scratchpad strategy for getting length generalization on addition If not specified otherwise, our token space is of size 24 and made of V = {0, . . . , 9, +, a, . . . , j, , <|BOS|>, <|EOS|>, <|SEP|>}. All the digits are sampled uniformly as follows: we first sample the length of each operand (between 2 and L = 50) and then independently sample each digit. Finally, we pack the context with i.i.d. sequences during training, i.e. we fill the context with multiple independent samples of the task (similarly to (Zhou et al., 2023)). |
| Dataset Splits | No | At test time, we evaluate our models using a sliding window: we generate tokens until the training context length (500) is filled, and then each additional token is generated by feeding the context of the most recent 500 tokens, effectively dropping all past tokens1. This way, we are able to generate very long sequences of tokens without training or evaluating on long context windows. To evaluate the accuracy at a given length, we test the model s output on 288 examples. The paper describes a strategy for generating data for training and evaluating test sets based on length generalization, but it does not specify explicit training/validation/test splits in terms of percentages or counts for a fixed dataset. |
| Hardware Specification | No | The paper does not provide specific hardware details such as GPU models, CPU types, or memory specifications used for running the experiments. |
| Software Dependencies | No | Our base model is a 150M parameter Transformer with L = 12 layers, a D = 1024 hidden size, feedforward layer with a hidden dimension of 4096 and H = 16 attention heads. The backbone of our model is based on the GPT-Neo X architecture (Black et al., 2022). We use the Adam W optimizer (Loshchilov & Hutter, 2017) to train the model with a weight decay value of 0.1 and no dropout, for 200,000 steps. While the paper mentions the model architecture (GPT-Neo X) and the optimizer (Adam W), it does not specify software versions for programming languages, libraries (e.g., PyTorch, TensorFlow), or other dependencies. |
| Experiment Setup | Yes | Our base model is a 150M parameter Transformer with L = 12 layers, a D = 1024 hidden size, feedforward layer with a hidden dimension of 4096 and H = 16 attention heads. The backbone of our model is based on the GPT-Neo X architecture (Black et al., 2022). We pick a context length of 500 tokens. We use the Adam W optimizer (Loshchilov & Hutter, 2017) to train the model with a weight decay value of 0.1 and no dropout, for 200,000 steps. The learning rate schedule incorporates an initial 100-step linear warm-up, followed by a linear decay, starting at 7e-5. Similarly to (Jelassi et al., 2024), we use M masked heads and (H M) No PE heads. In the masked heads, we respectively set the hyperparameter m to 1, 2,... and M. We swept over {3, 4, 5, 6, 7, 8} and found that M = 6 is the best choice. |