reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Arithmetic Transformers Can Length-Generalize in Both Operand Length and Count

Authors: Hanseul Cho, Jaeyoung Cha, Srinadh Bhojanapalli, Chulhee Yun

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this work, we achieve approximately 2/3 length generalization on both tasks, which is the first such achievement in arithmetic Transformers. We design task-specific scratchpads enabling the model to focus on a fixed number of tokens per each next-token prediction step, and apply multi-level versions of Position Coupling... Remarkably, with our proposed multi-level Position Coupling with scratchpad, we achieve a significant length generalization superior to all other methods. (Introduction, page 1) and We present median exact-match accuracies for 6-layer 8-head decoder-only Transformers trained on multi-operand additions... (Figure 1 caption). We trained models using 3 different PE methods Position Coupling, No PE, and FIRE both with and without scratchpad. (Section 4.3).
Researcher Affiliation	Collaboration	Hanseul Cho , Jaeyoung Cha Kim Jaechul Graduate School of AI, KAIST EMAIL Srinadh Bhojanapalli Google Research EMAIL Chulhee Yun Kim Jaechul Graduate School of AI, KAIST EMAIL
Pseudocode	No	The paper provides a formal construction and theoretical analysis in Section 4.4 and Appendix D, which uses mathematical expressions and descriptions rather than pseudocode or algorithm blocks. For example, Section D.5.1 states: "We define the query matrix Q1 and the key matrix K1 as follows: [matrix definitions]...". This describes a constructive proof for a specific model, not a general algorithm in pseudocode format.
Open Source Code	Yes	1All our experiments are run with code in github.com/Hanseul Jo/position-coupling.
Open Datasets	No	For the sake of simplicity of notation, we denote a training set by SA(n, m), consisting of addition problems where each operand can have at most n digits and the operand count can range from 2 to m. The dataset consists of two equally sized chunks. In the first chunk, for each sample, the number of operands is uniformly sampled from [2 : m], and the length of each operand is independently sampled from [1 : n]... (Section 4.2 "Data Sampling"). The paper describes the process of generating synthetic datasets but does not provide specific access information (link, DOI, repository) for a publicly available or open dataset.
Dataset Splits	Yes	We use SA(10, 10) with size 500,000 as the baseline training set. For model evaluation, we draw a 30x29 grid heatmap and assess the performance of the model on the test set TA(i, j) of size 1,000 for every entry (i, j) [1 : 30] [2 : 30].
Hardware Specification	Yes	Device NVIDIA RTX A6000 48GB
Software Dependencies	No	This codebase includes a custom implementation of a decoder-only T5 model (Raffel et al., 2020) built upon PyTorch (Paszke et al., 2019) and Hugging Face (Wolf et al., 2019)... (Section B "EXPERIMENTAL DETAILS", page 18). While software tools like PyTorch and Hugging Face are mentioned, specific version numbers for these dependencies are not provided.
Experiment Setup	Yes	Tables 1, 2, and 3 in Section B "EXPERIMENTAL DETAILS" provide detailed hyperparameter summaries for the parity, addition, and multiplication tasks, including "Number of Layers", "Number of Attention Heads", "Embedding Dimension", "Training Steps", "Batch Size", "Optimizer", "Learning Rate (LR)", "LR Warm-up", "LR Cool-down", and "Maximum Position ID" (pages 18-19).