Looped Transformers for Length Generalization
Authors: Ying Fan, Yilun Du, Kannan Ramchandran, Kangwook Lee
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate the efficacy of looped Transformers in solving tasks that require length generalization. We introduce the experimental setup in Section 6.1, present length generalization results in Section 6.2 and ablation studies in Section 6.3, and visualize the stopping criterion in Section 6.4. |
| Researcher Affiliation | Academia | Ying Fan1, Yilun Du2, Kannan Ramchandran3, Kangwook Lee1 1University of Wisconsin-Madison 2Massachusetts Institute of Technology 3UC Berkeley |
| Pseudocode | Yes | Here we provide the n-RASP-L programs for our Parity, Addition and Copy tasks in Listings 1, 2, 3. We also present the RASP-L library functions we use in Listing 4, which is partially taken from Zhou et al. (2024a). |
| Open Source Code | Yes | Code is available at https://github.com/UW-Madison-Lee-Lab/looped-tf. |
| Open Datasets | No | The paper describes tasks (Parity, Copy, Addition, etc.) and specifies how data is generated: "Given any length, the probability of each possible character is evenly distributed instead of from a finite train set to avoid over-fitting, and the length is also evenly distributed." This indicates data is synthetically generated rather than using a pre-existing, named public dataset with concrete access information (link, DOI, citation). |
| Dataset Splits | Yes | For the training distribution, we adopt the online training scheme following Zhou et al. (2024a) where each batch is i.i.d. sampled. Given any length, the probability of each possible character is evenly distributed instead of from a finite train set to avoid over-fitting, and the length is also evenly distributed. We also use a curriculum to gradually increase the maximum training length (see Table 2 for the specific setup for each task). For evaluation, we test with 6400 random samples in Figure 4, 5, 7, 6, and report the mean exact match accuracy and standard error from five training runs with different random seeds. |
| Hardware Specification | No | The paper does not explicitly specify the hardware (e.g., GPU model, CPU type, memory) used for running the experiments. It only discusses the model architecture and training setup. |
| Software Dependencies | No | The paper mentions using a "decoder-only GPT-2 architecture" and "Adam W optimizer" but does not provide specific version numbers for any software libraries, frameworks (like PyTorch or TensorFlow), or programming languages used in the implementation. |
| Experiment Setup | Yes | For training, we use a decoder-only GPT-2 architecture (Radford et al., 2019). We adopt a curriculum learning strategy for all methods that starts from the smallest length and incrementally increases the length during training till it reaches the maximum length as in Garg et al. (2022). For the training distribution, we adopt the online training scheme following Zhou et al. (2024a) where each batch is i.i.d. sampled... We also use a curriculum to gradually increase the maximum training length (see Table 2 for the specific setup for each task). We use Adam W optimizer and decay the learning rate from 10-4 to 0 with cosine decay schedulers after reaching the maximum training length with batch size 64, and train for a total number of 100k gradient steps. Additionally, non-converging tasks are less tolerant when choosing which step to stop, and we find using the exponential moving average (EMA) of model parameters helpful for Parity and Binary Sum with a factor 0.9999. Table 2: Task-specific experimental hyperparameters. Number of Heads and Block Depth define the size of the looped decoder block. Interval denotes the number of training steps between successive increases in the input sequence length. |