Looking Beyond the Top-1: Transformers Determine Top Tokens in Order
Authors: Daria Lioubashevski, Tomer M. Schlank, Gabriel Stanovsky, Ariel Goldstein
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this work, we analyze the computation performed by Transformers in the layers after the top-1 prediction remains fixed, known as the saturation event. We expand this concept to top-k tokens, demonstrating that similar saturation events occur across language, vision, and speech models. We find that these events occur in order of the corresponding tokens ranking, i.e., the model first decides on the top ranking token, then the second highest ranking token, and so on. This phenomenon seems intrinsic to the Transformer architecture, occurring across different variants, and even in untrained Transformers. We propose that these events reflect task transitions, where determining each token corresponds to a discrete task. We show that it is possible to predict the current task from hidden layer embedding, and demonstrate that we can cause the model to switch to the next task via intervention. Leveraging our findings, we introduce a tokenlevel early-exit strategy, surpassing existing methods in balancing performance and efficiency and show how to exploit saturation events for better language modeling. ... In this section, we first extend the formal definition of top-1 saturation to account for arbitrary i-th ranking token saturation (Section 2.1). Building on this, we formulate two experiments to understand what computation the Transformer performs in the layers after the top-1 saturation event. The first experiment (Section 2.2) leverages our definition to develop a metric capturing the extent to which top tokens are saturated in order. The second experiment (Section 2.3) uses a probing approach to test whether it is possible to determine the rank of the token currently being determined by the model solely from intermediate layer activations. |
| Researcher Affiliation | Academia | 1School of Computer Science and Engineering, The Hebrew University of Jerusalem, Jerusalem, Israel 2Department of Mathematics, The University of Chicago, Chicago, US 3Business School, The Hebrew University of Jerusalem, Jerusalem, Israel 4Department of Cognitive and Brain Sciences, The Hebrew University of Jerusalem, Jerusalem, Israel. Correspondence to: Daria Lioubashevski <EMAIL>. |
| Pseudocode | No | The paper describes procedures and methods in paragraph text (e.g., Section 2.1, Section 2.2, Section 2.3, Appendix A.10) but does not include any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | The code for our experiments is available at: https://github.com/daria-lioubashevski/beyond-top1. |
| Open Datasets | Yes | We use 1K randomly sampled questions from MMLU test split which represent 60K-100K tokens (depending on the model). ... For our experiments we use the Vi T-L/16 variant pretrained on Image Net-21k and fine-tuned on Image Net 2012, and run inference on 5K randomly sampled images from the CIFAR10 (Krizhevsky et al., 2009) dataset. ... For our dataset we randomly sample 5K audios from Libri Speech (Panayotov et al., 2015). ... We use texts from CNN/DM and not MMLU for this experiment as they tend to be longer and have more pairs that fit our criteria for intervention (Hermann et al., 2015). |
| Dataset Splits | Yes | After extracting embeddings from 500 randomly sampled questions we split the data into train and test using 5-fold cross validation, and report the mean and standard error of the accuracy. |
| Hardware Specification | No | The paper mentions various Transformer models (Llama3-8B, GPT2-XL, Mistral-7B, Falcon-7B, Vi T-L/16, Whisper-large), but does not provide specific hardware details (e.g., GPU models, CPU models, or memory) used for running the experiments. |
| Software Dependencies | No | The paper mentions specific pre-trained models such as Llama3-8B, GPT2-XL, Mistral-7B, Falcon-7B, Vi T-L/16, and Whisper-large, and also discusses 8-bit quantized versions. However, it does not specify version numbers for any underlying software libraries, frameworks (e.g., PyTorch, TensorFlow), or programming languages used for implementation. |
| Experiment Setup | No | The paper mentions using a 'simple one-versus-all multi-class logistic regression classifier' and describes how training data was categorized and balanced, and that 5-fold cross-validation was used. However, it does not provide specific hyperparameter values (e.g., learning rate, batch size, number of epochs) for training this classifier or any other explicit configuration details for the experiments. |