Looped Transformers for Length Generalization

Authors: Ying Fan, Yilun Du, Kannan Ramchandran, Kangwook Lee

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate the efficacy of looped Transformers in solving tasks that require length generalization. We introduce the experimental setup in Section 6.1, present length generalization results in Section 6.2 and ablation studies in Section 6.3, and visualize the stopping criterion in Section 6.4.
Researcher Affiliation Academia Ying Fan1, Yilun Du2, Kannan Ramchandran3, Kangwook Lee1 1University of Wisconsin-Madison 2Massachusetts Institute of Technology 3UC Berkeley
Pseudocode Yes Here we provide the n-RASP-L programs for our Parity, Addition and Copy tasks in Listings 1, 2, 3. We also present the RASP-L library functions we use in Listing 4, which is partially taken from Zhou et al. (2024a).
Open Source Code Yes Code is available at https://github.com/UW-Madison-Lee-Lab/looped-tf.
Open Datasets No The paper describes tasks (Parity, Copy, Addition, etc.) and specifies how data is generated: "Given any length, the probability of each possible character is evenly distributed instead of from a finite train set to avoid over-fitting, and the length is also evenly distributed." This indicates data is synthetically generated rather than using a pre-existing, named public dataset with concrete access information (link, DOI, citation).
Dataset Splits Yes For the training distribution, we adopt the online training scheme following Zhou et al. (2024a) where each batch is i.i.d. sampled. Given any length, the probability of each possible character is evenly distributed instead of from a finite train set to avoid over-fitting, and the length is also evenly distributed. We also use a curriculum to gradually increase the maximum training length (see Table 2 for the specific setup for each task). For evaluation, we test with 6400 random samples in Figure 4, 5, 7, 6, and report the mean exact match accuracy and standard error from five training runs with different random seeds.
Hardware Specification No The paper does not explicitly specify the hardware (e.g., GPU model, CPU type, memory) used for running the experiments. It only discusses the model architecture and training setup.
Software Dependencies No The paper mentions using a "decoder-only GPT-2 architecture" and "Adam W optimizer" but does not provide specific version numbers for any software libraries, frameworks (like PyTorch or TensorFlow), or programming languages used in the implementation.
Experiment Setup Yes For training, we use a decoder-only GPT-2 architecture (Radford et al., 2019). We adopt a curriculum learning strategy for all methods that starts from the smallest length and incrementally increases the length during training till it reaches the maximum length as in Garg et al. (2022). For the training distribution, we adopt the online training scheme following Zhou et al. (2024a) where each batch is i.i.d. sampled... We also use a curriculum to gradually increase the maximum training length (see Table 2 for the specific setup for each task). We use Adam W optimizer and decay the learning rate from 10-4 to 0 with cosine decay schedulers after reaching the maximum training length with batch size 64, and train for a total number of 100k gradient steps. Additionally, non-converging tasks are less tolerant when choosing which step to stop, and we find using the exponential moving average (EMA) of model parameters helpful for Parity and Binary Sum with a factor 0.9999. Table 2: Task-specific experimental hyperparameters. Number of Heads and Block Depth define the size of the looped decoder block. Interval denotes the number of training steps between successive increases in the input sequence length.