reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

TRA: Better Length Generalisation with Threshold Relative Attention

Authors: Mattia Opper, Roland Fernandez, Paul Smolensky, Jianfeng Gao

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	To test our hypothesis, we first turn to controlled synthetic tasks, followed by language modeling. Flip-Flop Language Modelling: Introduced by Liu et al. (2023a) Flip-Flop language modelling is a algorithmic reasoning benchmark designed to test transformers ability to glitchlessly handle sequential dependencies. Table 1: Results represent the average across four random initialisations. Metric is full sequence accuracy (exact match).
Researcher Affiliation	Collaboration	Mattia Opper EMAIL University of Edinburgh Roland Fernandez EMAIL Microsoft Research Paul Smolensky EMAIL Microsoft Research Johns Hopkins University Jianfeng Gao EMAIL Microsoft Research
Pseudocode	Yes	A Implementation: 1 class TRA(nn.Module): ...Listing 1: TRA Implementation
Open Source Code	Yes	1Pytorch code for core mechanism in Appendix A.
Open Datasets	Yes	Flip-Flop Language Modelling: Introduced by Liu et al. (2023a) Flip-Flop language modelling is a algorithmic reasoning benchmark designed to test transformers ability to glitchlessly handle sequential dependencies. For our experiments we turn to the Wiki Text-103 benchmark (Merity et al., 2016), which consists of full Wikipedia articles comprising circa 100 million tokens, probing both scale and ability to handle long distance dependencies.
Dataset Splits	Yes	The task consists of three test sets: IID, and two out of distribution sets that vary the number of intervening ignore instructions. OOD sparse increases the number of ignore instructions and requires the ability to handle increased dependency distance. OOD dense decreases the number of ignore instructions and consequently probes whether the model retain focus in the presence of an increased number of attractors (i.e. write instructions). For the copy and induct tasks we train on sequences with input length 50 and evaluate on buckets of increased OOD lengths up to 300. The training set contains examples from each of the instructions and sequence lengths of 0-50. The test set consists of of sequences of lengths 51-500. In distribution: we measure perplexity on the test set using the training window of size of 128. Out of distribution: we increase the window size to 4096 and observe the extent to which perplexity changes with increased length.
Hardware Specification	Yes	On a single Nvidia A40 card with window size 256 and batch size 64: Tiny Ro PE: 20k steps 58.73 minutes Tiny TRA: 20k steps 65.88 minutes (circa seven minute increase) Medium Ro PE: 20k steps 156.34 minutes Medium TRA: 20k steps 178.19 minutes (circa 22 minute increase)
Software Dependencies	No	The paper mentions using Python implicitly for the Pytorch code and refers to the GPT-2 tokenizer, but no specific version numbers for any libraries or dependencies are provided.
Experiment Setup	Yes	For all models we use the mini configuration from Turc et al. (2019): four heads, four layers and a 256 dimensional embedding size. The MLP is set 2x embedding size for both the linear and gate hidden units. Our focus is on small models following prior work (Zhou et al., 2023). Furthermore, attention glitches have been shown to persist at scale (Liu et al., 2023a), and the true solution for these tasks should not require additional layers. We use a linear warm-up for the first 5% of steps coupled with cosine decay. Other task and model specific hyper-parameters can be found in Appendix B. For synthetic tasks we train for one full pass through the data, for copy and induct this required circa 100k steps as our training set consisted of 4 million examples, while for flip-flop language modeling this corresponded to 20k. We used batch size 128 for the former tasks and batch size 64 for FFL to accomodate its larger sequence length (512). For Ro PE we set the theta to 500k following Dubey et al. (2024). For Co PE we set npos max to 64 following the original authors (Golovneva et al., 2024). All models are trained using the Adam W optimiser with a low dropout of 0.01 applied to the attention weights and feedforward hidden representation. For language modeling we increase model size to the medium configuration of Turc et al. (2019); Opper et al. (2023a). This is eight layers, eight heads and a 512 dimensional embedding size. The MLP is set 2x embedding size for both the linear and gate hidden units as before. We use the GPT-2 tokenizer (Radford et al., 2019) which leads to a vocabulary size of roughly 52k. Totaling circa 80 million parameters for both TRA and the baselines. We train using: window size 128, batch size 64, 100k steps. Our scheduling regime remains the same as before.