reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Lines of Thought in Large Language Models

Authors: Raphaël Sarfati, Toni Liu, Nicolas Boulle, Christopher Earls

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We investigate which large-scale, ensemble properties can be inferred experimentally without concern for the microscopic details. Specifically, we are interested in the trajectories, or lines of thought (Lo T), that embedded tokens realize in the latent space when passing through successive transformer layers (Aubry et al., 2024). By splitting a large input text into Ntoken sequences, we study Lo T ensemble properties to shed light on the internal, average processes that characterize transformer transport. The results presented in Fig. 5 show that the simulated ensembles closely reproduce the ground truth of true trajectory distributions.
Researcher Affiliation	Academia	Rapha el Sarfati School of Civil and Environmental Engineering Cornell University, USA EMAIL, Toni J.B. Liu Department of Physics Cornell University, USA EMAIL, Nicolas Boull e Department of Mathematics Imperial College London, UK EMAIL, Christopher J. Earls Center for Applied Mathematics School of Civil and Environmental Engineering Cornell University, USA EMAIL
Pseudocode	Yes	Algorithm 1 Trajectory generation in transformer-based model
Open Source Code	Yes	Code for trajectory generation, visualization, and analysis is available on Github at https://github.com/rapsar/lines-of-thought.
Open Datasets	Yes	The main corpus in this study comes from Henry David Thoreau s Walden, obtained from the Gutenberg Project (Project Gutenberg, 2024).
Dataset Splits	Yes	We generate inputs by tokenizing (Wolf et al., 2020) a large text and then chopping it into pseudo-sentences , i.e., chunks of a fixed number of tokens Nk (see Algorithm 1). Unless otherwise noted, Nk = 50. These non-overlapping chunks are consistent in terms of token cardinality, and possess the structure of language, but have various meanings and endings (see Appendix A.1). The main corpus in this study comes from Henry David Thoreau s Walden... We typically use a set of Ns 3000 14000 pseudo-sentences.
Hardware Specification	No	The paper mentions various LLM models (GPT-2 medium, Llama 2 7B, Mistral 7B, Llama 3.2) but does not specify the hardware used to run or analyze these models (e.g., specific GPU or CPU models, memory, or cloud resources).
Software Dependencies	No	The paper mentions using specific LLMs like GPT-2, Llama 2, Mistral, and Llama 3.2, and references tokenizing with Wolf et al., 2020 (Huggingface's transformers), but it does not provide specific version numbers for any software libraries, frameworks (e.g., PyTorch, TensorFlow), or programming languages used for their analysis.
Experiment Setup	Yes	Language models. We rely primarily on the 355M-parameter ( medium ) version of the GPT2 model (Radford et al., 2019)... We later extend our analysis to the Llama 2 7B (Touvron et al., 2023), Mistral 7B v0.1 (Jiang et al., 2023), and small Llama 3.2 models (1B and 3B) (Meta AI, 2024). Input ensembles. We generate inputs by tokenizing (Wolf et al., 2020) a large text and then chopping it into pseudo-sentences , i.e., chunks of a fixed number of tokens Nk (see Algorithm 1). Unless otherwise noted, Nk = 50. The main corpus in this study comes from Henry David Thoreau s Walden... We typically use a set of Ns 3000 14000 pseudo-sentences. Trajectory collection. We form trajectories by collecting the successive vector outputs, within the latent space, after each transformer layer (hidden_states).