Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1]

Teaching Arithmetic to Small Transformers

Authors: Nayoung Lee, Kartik Sreenivasan, Jason D. Lee, Kangwook Lee, Dimitris Papailiopoulos

ICLR 2024 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental This study investigates how even small transformers, trained from random initialization, can efficiently learn arithmetic operations such as addition, multiplication, and elementary functions like square root, using the next-token prediction objective. We first demonstrate that conventional training data is not the most effective for arithmetic learning, and simple formatting changes can significantly improve accuracy.
Researcher Affiliation Academia Nayoung Lee University of Wisconsin-Madison EMAIL Kartik Sreenivasan University of Wisconsin-Madison EMAIL Jason D. Lee Princeton University EMAIL Kangwook Lee University of Wisconsin-Madison EMAIL Dimitris Papailiopoulos University of Wisconsin-Madison EMAIL
Pseudocode Yes We present the full pseudo-code in Algorithm 1.
Open Source Code Yes Our code is available at https://github.com/lee-ny/teaching_arithmetic
Open Datasets Yes For arithmetic tasks like addition, subtraction, and multiplication, we define the training dataset for a binary operator f( ) as Dtrain = {(ai, bi), yi}N i=1 where yi = f(ai, bi). ... We use the Shakespeare dataset (Karpathy, 2015) that includes 1, 115, 394 tokens of text...
Dataset Splits Yes The learning rate is chosen from {1e-3, 5e-4, 1e-4, 5e-5} based on validation loss.
Hardware Specification Yes All of our experiments on Nano GPT and GPT-2 models are run using Py Torch 2.1 and CUDA 11.7 on Nvidia 2808 TIs and NVIDIA 3090s.
Software Dependencies Yes All of our experiments on Nano GPT and GPT-2 models are run using Py Torch 2.1 and CUDA 11.7 on Nvidia 2808 TIs and NVIDIA 3090s.
Experiment Setup Yes In this section, we provide a detailed description of our experimental setup, including the model architecture and an overview of the various data formatting and sampling techniques used. ... Table 16: Hyper Parameters used for Nano GPT experiments on arithmetic tasks ... Table 17: Hyper Parameters used for GPT-2 experiments on arithmetic tasks