TRA: Better Length Generalisation with Threshold Relative Attention
Authors: Mattia Opper, Roland Fernandez, Paul Smolensky, Jianfeng Gao
TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | To test our hypothesis, we first turn to controlled synthetic tasks, followed by language modeling. Flip-Flop Language Modelling: Introduced by Liu et al. (2023a) Flip-Flop language modelling is a algorithmic reasoning benchmark designed to test transformers ability to glitchlessly handle sequential dependencies. Table 1: Results represent the average across four random initialisations. Metric is full sequence accuracy (exact match). |
| Researcher Affiliation | Collaboration | Mattia Opper EMAIL University of Edinburgh Roland Fernandez EMAIL Microsoft Research Paul Smolensky EMAIL Microsoft Research Johns Hopkins University Jianfeng Gao EMAIL Microsoft Research |
| Pseudocode | Yes | A Implementation: 1 class TRA(nn.Module): ...Listing 1: TRA Implementation |
| Open Source Code | Yes | 1Pytorch code for core mechanism in Appendix A. |
| Open Datasets | Yes | Flip-Flop Language Modelling: Introduced by Liu et al. (2023a) Flip-Flop language modelling is a algorithmic reasoning benchmark designed to test transformers ability to glitchlessly handle sequential dependencies. For our experiments we turn to the Wiki Text-103 benchmark (Merity et al., 2016), which consists of full Wikipedia articles comprising circa 100 million tokens, probing both scale and ability to handle long distance dependencies. |
| Dataset Splits | Yes | The task consists of three test sets: IID, and two out of distribution sets that vary the number of intervening ignore instructions. OOD sparse increases the number of ignore instructions and requires the ability to handle increased dependency distance. OOD dense decreases the number of ignore instructions and consequently probes whether the model retain focus in the presence of an increased number of attractors (i.e. write instructions). For the copy and induct tasks we train on sequences with input length 50 and evaluate on buckets of increased OOD lengths up to 300. The training set contains examples from each of the instructions and sequence lengths of 0-50. The test set consists of of sequences of lengths 51-500. In distribution: we measure perplexity on the test set using the training window of size of 128. Out of distribution: we increase the window size to 4096 and observe the extent to which perplexity changes with increased length. |
| Hardware Specification | Yes | On a single Nvidia A40 card with window size 256 and batch size 64: Tiny Ro PE: 20k steps 58.73 minutes Tiny TRA: 20k steps 65.88 minutes (circa seven minute increase) Medium Ro PE: 20k steps 156.34 minutes Medium TRA: 20k steps 178.19 minutes (circa 22 minute increase) |
| Software Dependencies | No | The paper mentions using Python implicitly for the Pytorch code and refers to the GPT-2 tokenizer, but no specific version numbers for any libraries or dependencies are provided. |
| Experiment Setup | Yes | For all models we use the mini configuration from Turc et al. (2019): four heads, four layers and a 256 dimensional embedding size. The MLP is set 2x embedding size for both the linear and gate hidden units. Our focus is on small models following prior work (Zhou et al., 2023). Furthermore, attention glitches have been shown to persist at scale (Liu et al., 2023a), and the true solution for these tasks should not require additional layers. We use a linear warm-up for the first 5% of steps coupled with cosine decay. Other task and model specific hyper-parameters can be found in Appendix B. For synthetic tasks we train for one full pass through the data, for copy and induct this required circa 100k steps as our training set consisted of 4 million examples, while for flip-flop language modeling this corresponded to 20k. We used batch size 128 for the former tasks and batch size 64 for FFL to accomodate its larger sequence length (512). For Ro PE we set the theta to 500k following Dubey et al. (2024). For Co PE we set npos max to 64 following the original authors (Golovneva et al., 2024). All models are trained using the Adam W optimiser with a low dropout of 0.01 applied to the attention weights and feedforward hidden representation. For language modeling we increase model size to the medium configuration of Turc et al. (2019); Opper et al. (2023a). This is eight layers, eight heads and a 512 dimensional embedding size. The MLP is set 2x embedding size for both the linear and gate hidden units as before. We use the GPT-2 tokenizer (Radford et al., 2019) which leads to a vocabulary size of roughly 52k. Totaling circa 80 million parameters for both TRA and the baselines. We train using: window size 128, batch size 64, 100k steps. Our scheduling regime remains the same as before. |