reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

From Tokens to Words: On the Inner Lexicon of LLMs

Authors: Guy Kaplan, Matanel Oren, Yuval Reif, Roy Schwartz

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our experiments show that this process primarily takes place within the early and middle layers of the model. We further demonstrate its robustness to arbitrary splits (e.g., cats to ca and ts ), typos, and importantly to outof-vocabulary words: when feeding the last token internal representations of such words to the model as input, it can understand them as the complete word despite never seeing such representations as input during training. Our results (Fig. 2b, blue) reveal a three-stage pattern in the model s representation of word and nonword token sequences. In the model s first few layers, representations from both groups are relatively indistinguishable and accuracy is close to chance level. Then, from layers 2 to 6, a clear distinction between the two groups emerges, until the representations are almost completely separate in middle layers, between layers 6 and 20. At this point, the probe achieves a stable, high accuracy, peaking at 89% on layer 13.
Researcher Affiliation	Academia	Guy Kaplan, Matanel Oren, Yuval Reif, and Roy Schwartz The Hebrew University of Jerusalem EMAIL
Pseudocode	No	The paper describes methods conceptually using figures and text (e.g., Figure 6: Our 3-step method to expand LLM vocabulary without updates to core model parameters.), but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	We release our code at https://github.com/schwartz-lab-NLP/Tokens2Words.
Open Datasets	Yes	The word dataset consists of 10,000 distinct words sampled from the Gutenberg corpus (Gerlach & Font-Clos, 2018) We iterate the WIKITEXT-103 dataset (Merity et al., 2017) and randomly split each single-token word longer than three characters into 2 5 sub-words tokens. We perform a similar experiment using a dataset of morphologically plausible nonwords (ARC Nonword Database; Rastle et al. 2002) We apply our approach to LLAMA2-7B and experiment with three datasets: WIKITEXT-103 (Merity et al., 2017), abstracts of biomedical articles from PUBMED (Xiong et al., 2024), and the Arabic split of WIKI40B (Guo et al., 2020).
Dataset Splits	Yes	The training set consists of 80% of the dataset, and the remaining 20% are used for evaluation.
Hardware Specification	No	Appendix H mentions: "taking up to 30 minutes on a single GPU." However, no specific GPU model, CPU, or other detailed hardware specifications are provided.
Software Dependencies	No	The paper mentions using specific LLMs like Llama2-7B, Llama3-8B, Mistral-7B, and Yi-6B, and interpretability methods like logit lens and Patchscopes, but does not provide specific version numbers for these or other software dependencies.
Experiment Setup	Yes	We apply a k-nearest neighbors (k-NN) probing classifier (k = 4, using Euclidean distance) on the hidden states of the last tokens of both words and nonwords, for each layer of the Llama2-7B model. We use a sequence length of 512 and train on 10,000 sequences, taking up to 30 minutes on a single GPU.