Arithmetic Transformers Can Length-Generalize in Both Operand Length and Count

Authors: Hanseul Cho, Jaeyoung Cha, Srinadh Bhojanapalli, Chulhee Yun

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this work, we achieve approximately 2/3 length generalization on both tasks, which is the first such achievement in arithmetic Transformers. We design task-specific scratchpads enabling the model to focus on a fixed number of tokens per each next-token prediction step, and apply multi-level versions of Position Coupling... Remarkably, with our proposed multi-level Position Coupling with scratchpad, we achieve a significant length generalization superior to all other methods. (Introduction, page 1) and We present median exact-match accuracies for 6-layer 8-head decoder-only Transformers trained on multi-operand additions... (Figure 1 caption). We trained models using 3 different PE methods Position Coupling, No PE, and FIRE both with and without scratchpad. (Section 4.3).
Researcher Affiliation Collaboration Hanseul Cho , Jaeyoung Cha Kim Jaechul Graduate School of AI, KAIST EMAIL Srinadh Bhojanapalli Google Research EMAIL Chulhee Yun Kim Jaechul Graduate School of AI, KAIST EMAIL
Pseudocode No The paper provides a formal construction and theoretical analysis in Section 4.4 and Appendix D, which uses mathematical expressions and descriptions rather than pseudocode or algorithm blocks. For example, Section D.5.1 states: "We define the query matrix Q1 and the key matrix K1 as follows: [matrix definitions]...". This describes a constructive proof for a specific model, not a general algorithm in pseudocode format.
Open Source Code Yes 1All our experiments are run with code in github.com/Hanseul Jo/position-coupling.
Open Datasets No For the sake of simplicity of notation, we denote a training set by SA(n, m), consisting of addition problems where each operand can have at most n digits and the operand count can range from 2 to m. The dataset consists of two equally sized chunks. In the first chunk, for each sample, the number of operands is uniformly sampled from [2 : m], and the length of each operand is independently sampled from [1 : n]... (Section 4.2 "Data Sampling"). The paper describes the process of generating synthetic datasets but does not provide specific access information (link, DOI, repository) for a publicly available or open dataset.
Dataset Splits Yes We use SA(10, 10) with size 500,000 as the baseline training set. For model evaluation, we draw a 30x29 grid heatmap and assess the performance of the model on the test set TA(i, j) of size 1,000 for every entry (i, j) [1 : 30] [2 : 30].
Hardware Specification Yes Device NVIDIA RTX A6000 48GB
Software Dependencies No This codebase includes a custom implementation of a decoder-only T5 model (Raffel et al., 2020) built upon PyTorch (Paszke et al., 2019) and Hugging Face (Wolf et al., 2019)... (Section B "EXPERIMENTAL DETAILS", page 18). While software tools like PyTorch and Hugging Face are mentioned, specific version numbers for these dependencies are not provided.
Experiment Setup Yes Tables 1, 2, and 3 in Section B "EXPERIMENTAL DETAILS" provide detailed hyperparameter summaries for the parity, addition, and multiplication tasks, including "Number of Layers", "Number of Attention Heads", "Embedding Dimension", "Training Steps", "Batch Size", "Optimizer", "Learning Rate (LR)", "LR Warm-up", "LR Cool-down", and "Maximum Position ID" (pages 18-19).