Encoder-only Next Token Prediction

Authors: Ethan Ewer, Daewon Chae, Thomas Zeng, Jinkyu Kim, Kangwook Lee

TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We introduce the Count3 task and show, both theoretically and experimentally, that while ENTP can perform this task easily, a decoder-only Transformer cannot. Finally, we empirically demonstrate the superior performance of ENTP across representative tasks where next-token prediction based Transformers can be evaluated, including addition, in-context learning, and language modeling.
Researcher Affiliation Academia Ethan Ewer EMAIL University of Wisconsin-Madison Daewon Chae EMAIL Korea University Thomas Zeng EMAIL University of Wisconsin-Madison Jinkyu Kim EMAIL Korea University Kangwook Lee EMAIL University of Wisconsin-Madison
Pseudocode Yes Algorithm 1 Implementation of Attention Using O(D) Memory Algorithm 2 Algorithm to compute Count3 in O(n2) time and O(1) space Algorithm 3 Algorithm to compute Count3 in O(n) time and O(n) space Algorithm 4 Count3 RASP Encoder Implementation Algorithm 5 Match3 RASP Decoder Implementation Algorithm 6 Count3 RASP Decoder COT Implementation
Open Source Code No The paper does not provide an explicit statement about the release of source code for the methodology described, nor does it include a link to a code repository. The OpenReview link is for the paper review process, not code.
Open Datasets Yes We train Transformer models on the Open Web Text dataset (Gokaslan et al., 2019), an open-source replication of Web Text used for GPT-2 training (Radford et al., 2019), using a next-token prediction objective. To assess commonsense reasoning, we use the Tiny Wino Grande benchmark (Polo et al., 2024), which tests pronoun resolution. We further evaluate the models on an NLP classification task using CLUTRR (Sinha et al., 2019), which requires identifying familial relationships from text.
Dataset Splits Yes To generate unique sequences, we start each sequence with a seed containing 16 random integers between 0 and 63. Then we extend the sequence to 64 integers using Equation (6)... Seeds are generated randomly during training and evaluation. We sample the dataset of all possible 3-digit addition examples... Then we randomly remove 90% of the 3-digit addition examples, adjusting the ratio of 3-digit to 2-digit examples from around 100:1 to around 10:1. Next we split the data into training, testing, and validation splits, stratified by the number of digits and carries in each addition example. All 1-digit addition examples were put into the training split. Training is performed on numbers with up to 10 digits, while testing extends to numbers with up to 15 digits. We randomly select 10,000 examples for the training set and conduct each experiment using three different random seeds.
Hardware Specification No The paper does not provide specific hardware details such as CPU/GPU models, memory, or specific cloud instance types used for running the experiments. It mentions support from 'Furiosa AI' in acknowledgments, but not that their hardware was used for experiments.
Software Dependencies No The paper does not explicitly state the version numbers of any software dependencies used for the experiments (e.g., Python, PyTorch, TensorFlow, specific libraries).
Experiment Setup Yes For GPT-4o, we used the official API, setting the batch size to 4 and the learning rate multiplier to 10. For Llama3-8B, we employed Lo RA fine-tuning Hu et al. (2022) with a batch size of 16 and a learning rate of 1.4 * 10^-4. Table 6: Model specifications (Name, Number of Layers, Number of Heads, Embedding Dimension). Table 8: Open Web Text Hyperparameters (warmup_iters, lr_decay_iters, min_lr, max_lr, beta1, beta2, weight_decay, block_size, batch_size). All sample complexity tests are run with at least 5 different seeds. All length generalization tests are run with 3 different seeds.