reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Interpreting the Repeated Token Phenomenon in Large Language Models

Authors: Itay Yona, Ilia Shumailov, Jamie Hayes, Yossi Gandelsman

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We present empirical evidence for this two-stage mechanism across multiple LLMs. To investigate the role of specific neurons in mediating high norms, we perform an ablation study. We zero-ablated the candidate neurons... Ablating specific neurons significantly reduces the high norms associated with repeated tokens. Data is from LLa Ma-2. Table 2. The effect of patching on unrelated tasks. We compare LLa Ma-1, LLa Ma-2 and Mistral, before and after the patching on different benchmarks.
Researcher Affiliation	Collaboration	1Google Deep Mind 2UC Berkeley. Correspondence to: Itay Yona <EMAIL>.
Pseudocode	Yes	Listing 1. A manual patch to fix repeated tokens issue. tmp_output = None sink_neuron = 7890 sink_layer = 1 def patch_sink(x, phase): global tmp_output if phase == "prefill": tmp_output = x[:,1, sink_neuron] x[:,1:,sink_neuron] = tmp_output if phase == "decode": x[:,0, sink_neuron] = tmp_output patch_block = model.blocks[sink_layer] patch_block.mlp.up_proj.hook(patch_sink)
Open Source Code	Yes	Code is available here.
Open Datasets	Yes	We find that for Pythia-12b (Biderman et al., 2023), an open-source model trained on the publicly available The Pile dataset (Gao et al., 2020)... We also show the norm of the Bo S token and the average norm of tokens from Tiny Shakespeare dataset (Andrej, 2015) for comparison.
Dataset Splits	No	The paper evaluates LLMs on standard benchmarks (MMLU, HellaSwag, TruthfulQA, WinoGrande, AI2-ARC) and uses existing datasets like The Pile and Tiny Shakespeare, but it does not explicitly describe any specific training/test/validation splits or their methodologies for the experiments conducted in this paper.
Hardware Specification	No	The paper does not provide specific details on the hardware (e.g., GPU/CPU models, memory) used to run the experiments. It mentions the LLM models analyzed but not the computational resources for the analysis itself.
Software Dependencies	No	The paper refers to various Large Language Models (LLMs) and includes a Python-like code snippet in Listing 1, but it does not specify any software dependencies with version numbers (e.g., Python version, specific library versions like PyTorch or TensorFlow).
Experiment Setup	No	The paper describes a 'manual patch' to fix repeated tokens and an ablation study, but it does not provide specific experimental setup details such as hyperparameter values (e.g., learning rate, batch size, number of epochs) or other system-level training settings.