reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Encoder-only Next Token Prediction

Authors: Ethan Ewer, Daewon Chae, Thomas Zeng, Jinkyu Kim, Kangwook Lee

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We introduce the Count3 task and show, both theoretically and experimentally, that while ENTP can perform this task easily, a decoder-only Transformer cannot. Finally, we empirically demonstrate the superior performance of ENTP across representative tasks where next-token prediction based Transformers can be evaluated, including addition, in-context learning, and language modeling.
Researcher Affiliation	Academia	Ethan Ewer EMAIL University of Wisconsin-Madison Daewon Chae EMAIL Korea University Thomas Zeng EMAIL University of Wisconsin-Madison Jinkyu Kim EMAIL Korea University Kangwook Lee EMAIL University of Wisconsin-Madison
Pseudocode	Yes	Algorithm 1 Implementation of Attention Using O(D) Memory Algorithm 2 Algorithm to compute Count3 in O(n2) time and O(1) space Algorithm 3 Algorithm to compute Count3 in O(n) time and O(n) space Algorithm 4 Count3 RASP Encoder Implementation Algorithm 5 Match3 RASP Decoder Implementation Algorithm 6 Count3 RASP Decoder COT Implementation
Open Source Code	No	The paper does not provide an explicit statement about the release of source code for the methodology described, nor does it include a link to a code repository. The OpenReview link is for the paper review process, not code.
Open Datasets	Yes	We train Transformer models on the Open Web Text dataset (Gokaslan et al., 2019), an open-source replication of Web Text used for GPT-2 training (Radford et al., 2019), using a next-token prediction objective. To assess commonsense reasoning, we use the Tiny Wino Grande benchmark (Polo et al., 2024), which tests pronoun resolution. We further evaluate the models on an NLP classification task using CLUTRR (Sinha et al., 2019), which requires identifying familial relationships from text.
Dataset Splits	Yes	To generate unique sequences, we start each sequence with a seed containing 16 random integers between 0 and 63. Then we extend the sequence to 64 integers using Equation (6)... Seeds are generated randomly during training and evaluation. We sample the dataset of all possible 3-digit addition examples... Then we randomly remove 90% of the 3-digit addition examples, adjusting the ratio of 3-digit to 2-digit examples from around 100:1 to around 10:1. Next we split the data into training, testing, and validation splits, stratified by the number of digits and carries in each addition example. All 1-digit addition examples were put into the training split. Training is performed on numbers with up to 10 digits, while testing extends to numbers with up to 15 digits. We randomly select 10,000 examples for the training set and conduct each experiment using three different random seeds.
Hardware Specification	No	The paper does not provide specific hardware details such as CPU/GPU models, memory, or specific cloud instance types used for running the experiments. It mentions support from 'Furiosa AI' in acknowledgments, but not that their hardware was used for experiments.
Software Dependencies	No	The paper does not explicitly state the version numbers of any software dependencies used for the experiments (e.g., Python, PyTorch, TensorFlow, specific libraries).
Experiment Setup	Yes	For GPT-4o, we used the official API, setting the batch size to 4 and the learning rate multiplier to 10. For Llama3-8B, we employed Lo RA fine-tuning Hu et al. (2022) with a batch size of 16 and a learning rate of 1.4 * 10^-4. Table 6: Model specifications (Name, Number of Layers, Number of Heads, Embedding Dimension). Table 8: Open Web Text Hyperparameters (warmup_iters, lr_decay_iters, min_lr, max_lr, beta1, beta2, weight_decay, block_size, batch_size). All sample complexity tests are run with at least 5 different seeds. All length generalization tests are run with 3 different seeds.