reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Looking Beyond the Top-1: Transformers Determine Top Tokens in Order

Authors: Daria Lioubashevski, Tomer M. Schlank, Gabriel Stanovsky, Ariel Goldstein

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this work, we analyze the computation performed by Transformers in the layers after the top-1 prediction remains fixed, known as the saturation event. We expand this concept to top-k tokens, demonstrating that similar saturation events occur across language, vision, and speech models. We find that these events occur in order of the corresponding tokens ranking, i.e., the model first decides on the top ranking token, then the second highest ranking token, and so on. This phenomenon seems intrinsic to the Transformer architecture, occurring across different variants, and even in untrained Transformers. We propose that these events reflect task transitions, where determining each token corresponds to a discrete task. We show that it is possible to predict the current task from hidden layer embedding, and demonstrate that we can cause the model to switch to the next task via intervention. Leveraging our findings, we introduce a tokenlevel early-exit strategy, surpassing existing methods in balancing performance and efficiency and show how to exploit saturation events for better language modeling. ... In this section, we first extend the formal definition of top-1 saturation to account for arbitrary i-th ranking token saturation (Section 2.1). Building on this, we formulate two experiments to understand what computation the Transformer performs in the layers after the top-1 saturation event. The first experiment (Section 2.2) leverages our definition to develop a metric capturing the extent to which top tokens are saturated in order. The second experiment (Section 2.3) uses a probing approach to test whether it is possible to determine the rank of the token currently being determined by the model solely from intermediate layer activations.
Researcher Affiliation	Academia	1School of Computer Science and Engineering, The Hebrew University of Jerusalem, Jerusalem, Israel 2Department of Mathematics, The University of Chicago, Chicago, US 3Business School, The Hebrew University of Jerusalem, Jerusalem, Israel 4Department of Cognitive and Brain Sciences, The Hebrew University of Jerusalem, Jerusalem, Israel. Correspondence to: Daria Lioubashevski <EMAIL>.
Pseudocode	No	The paper describes procedures and methods in paragraph text (e.g., Section 2.1, Section 2.2, Section 2.3, Appendix A.10) but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	The code for our experiments is available at: https://github.com/daria-lioubashevski/beyond-top1.
Open Datasets	Yes	We use 1K randomly sampled questions from MMLU test split which represent 60K-100K tokens (depending on the model). ... For our experiments we use the Vi T-L/16 variant pretrained on Image Net-21k and fine-tuned on Image Net 2012, and run inference on 5K randomly sampled images from the CIFAR10 (Krizhevsky et al., 2009) dataset. ... For our dataset we randomly sample 5K audios from Libri Speech (Panayotov et al., 2015). ... We use texts from CNN/DM and not MMLU for this experiment as they tend to be longer and have more pairs that fit our criteria for intervention (Hermann et al., 2015).
Dataset Splits	Yes	After extracting embeddings from 500 randomly sampled questions we split the data into train and test using 5-fold cross validation, and report the mean and standard error of the accuracy.
Hardware Specification	No	The paper mentions various Transformer models (Llama3-8B, GPT2-XL, Mistral-7B, Falcon-7B, Vi T-L/16, Whisper-large), but does not provide specific hardware details (e.g., GPU models, CPU models, or memory) used for running the experiments.
Software Dependencies	No	The paper mentions specific pre-trained models such as Llama3-8B, GPT2-XL, Mistral-7B, Falcon-7B, Vi T-L/16, and Whisper-large, and also discusses 8-bit quantized versions. However, it does not specify version numbers for any underlying software libraries, frameworks (e.g., PyTorch, TensorFlow), or programming languages used for implementation.
Experiment Setup	No	The paper mentions using a 'simple one-versus-all multi-class logistic regression classifier' and describes how training data was categorized and balanced, and that 5-fold cross-validation was used. However, it does not provide specific hyperparameter values (e.g., learning rate, batch size, number of epochs) for training this classifier or any other explicit configuration details for the experiments.