Plug, Play, and Generalize: Length Extrapolation with Pointer-Augmented Neural Memory

Authors: Hung Le, Dung Nguyen, Kien Do, Svetha Venkatesh, Truyen Tran

TMLR 2024 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experiments showcase PANM s exceptional length extrapolation capabilities and its enhancement of recurrent neural networks in symbol processing tasks, including algorithmic reasoning and Dyck language recognition. PANM enables Transformers to achieve up to 100% generalization accuracy in compositional learning tasks and significantly improves performance in mathematical reasoning, question answering, and machine translation.
Researcher Affiliation Academia Hung Le EMAIL Applied AI Institute Deakin University, Australia
Pseudocode Yes A summary of PANM s operation is given in Algo. 1 and Fig. 4 in the Appendix. [...] Algorithm 1: PANM training.
Open Source Code No The text discusses the source code of a third-party tool or platform that the authors used, but does not provide their own implementation code. For example: "For NTM and DNC, we use public repositories...", "We use the ESBN s author codebase...", "we adopt the SRNN code from Suzgun et al. (2019)...", "we adopt the code from Csordás et al. (2021)." There is no explicit statement of the authors releasing their own PANM implementation.
Open Datasets Yes We also apply PANM to Transformer models, improving their performance on compositional learning with SCAN and mathematics datasets. Additionally, PANM significantly enhances Transformer and BERT generalization in question answering and machine translation tasks. [...] For this purpose, we utilize two datasets, namely b Ab I (Weston et al., 2015) and SQUAD 1.1 (Rajpurkar et al., 2016) [...] The results are presented in Fig. 3 (b) where we report the model perplexity on Multi30K (en-de) dataset. [...] We pick the first two tasks from the BIG-bench benchmark (Srivastava et al., 2023)
Dataset Splits Yes We focus on the length split datasets where the training sequences are shorter than the test ones with 11 length modes L = 22, 24, .., 40 (Newman et al., 2020). [...] In b Ab I, we configure the PANM similarly to the one described in 3.3 using Transformer backbone, and test the models after 100-epoch training. The models predict the answer tokens given the context and question tokens. As shown in Table 5 and Appendix Fig. 5 (right), PANM helps Transformer generalize better, consistently improving around 6% and 5% using 0.8/0.2 and 0.5/0.5 splits, respectively. [...] The 30K-sample dataset is sorted by input length and split into training and testing s.t. testing sequences are longer, similar to QA task. [...] In the first test, we use Copy and Dynamic Recall (D. Recall) similar to those described in 3.1. [...] the training data consists of 100,000 sequences, each of maximum 10 letters long, combined with an instruction introducing the task (see Appendix D.5). After fine-tuning, we evaluate the models on multiple testing sets, each of 1000 testing sequences with sequence lengths ranged from 10 to 1000 letters. [...] Table 10: BIG-bench tasks. Task Train/Test Size Train Max Length Test Max Length bigbench_arithmetic 1000/1000 126 characters 178 characters generate_until understanding_bigbench_abstract 700/200 828 characters 1507 characters narrative_generate_until
Hardware Specification Yes We trained all the models on a single Tesla V100-SXM2 GPU.
Software Dependencies No The paper mentions using specific software components like "Pytorch library", "BERT model (https://huggingface.co/bert-base-uncased)", and "torchtune library: https://github.com/pytorch/torchtune". However, it does not provide explicit version numbers for these software libraries, which is required for a reproducible description of ancillary software.
Experiment Setup Yes In our experiments, we use two pointer variables in Mode-1 access and one for Mode-2 to balance between performance and computing cost (Ha = 2, Hc = 1, see more in Appendix C). The two Mode-1 pointer variables are initialized as the base and end addresses. All MLPs in PANM have 1 hidden layer of 128 dimensions. We use 256-dimensional GRUs for PU and Ctrl. The memory s address space has b = 10 bits, corresponding to a maximum of 1024 unique addresses, which is greater than any sequence length in the experiments. [...] All baselines are trained with fixed number of steps (100K for ID Sort and 50K for the rest), which is enough for the training loss to converge. For each task, each baseline is trained 5 times with different random seeds and we use the best checkpoint on L + 1 mode validation to evaluate the baselines. [...] The models are finetuned with the same training configuration such as Lo RA (Hu et al., 2021) and Adam W optimizer (Loshchilov & Hutter, 2018). The evaluation is executed using Language Model Evaluation Harness library (Gao et al., 2023). [...] For experiments using Lo RA finetuning, we use Lo RA with the following configuration: Target layers: q_proj,v_proj The optimizer configuration is: Adam W with weight_decay of 0.01 Learning rate: 3e-4 Learning rate scheduler: Cosine scheduling with 100 warm up steps Training configuration: The batch size is 8 with 4 gradient accumulation steps Number of epochs: 3