reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Forking Paths in Neural Text Generation

Authors: Eric Bigelow, Ari Holtzman, Hidenori Tanaka, Tomer Ullman

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	To test this empirically, we develop a novel approach to representing uncertainty dynamics across individual tokens of text generation, and applying statistical models to test our hypothesis. Our approach is highly flexible: it can be applied to any dataset and any LLM, without fine tuning or accessing model weights. We use our method to analyze LLM responses on 7 different tasks across 4 domains, spanning a wide range of typical use cases. We find many examples of forking tokens, including surprising ones such as punctuation marks, suggesting that LLMs are often just a single token away from saying something very different.
Researcher Affiliation	Collaboration	Eric Bigelow1,2,3 , Ari Holtzman4, Hidenori Tanaka2,3 , and Tomer Ullman1,2 1Harvard University, Department of Psychology 2Harvard University, Center for Brain Science 3NTT Research, Physics & Informatics Lab 4University of Chicago, Department of Computer Science
Pseudocode	No	The paper describes a multi-stage sampling pipeline in Section 2.1 and visually in Figure 2, but it does so using descriptive text and flowcharts rather than structured pseudocode or an algorithm block with line numbers or code-like formatting.
Open Source Code	Yes	REPRODUCIBILITY STATEMENT All code and data used for this project is available at https://github.com/ebigelow/forking-paths.
Open Datasets	Yes	Coin Flip (Wei et al., 2022) is a very simple symbolic reasoning task... Last Letter (Wei et al., 2022) is more complex symbolic reasoning task... AQu A (Ling et al., 2017) and GSM8k (Cobbe et al., 2021) test mathematical reasoning... MMLU (Hendrycks et al., 2020) complex question answering dataset... Hotpot QA (Yang et al., 2018) is a complex question answering dataset... For our story generation task, we use the Story Cloze (Mostafazadeh et al., 2017) dataset...
Dataset Splits	No	The paper describes how subsets of data points were selected from existing datasets for analysis (e.g., 'analyzed a subset of 30 data points for each task', 'For GSM8k and MMLU, we used tiny Benchmarks (Polo et al., 2024), which are a subset of 100 examples', 'for Hotpot QA we excluded questions and answers with string length outside the [.1, .9] quantile range'), but it does not specify explicit training/test/validation splits for these datasets needed for model training reproduction, as the analysis focuses on existing LLM outputs.
Hardware Specification	No	The paper mentions evaluating OpenAI's GPT-3.5 and Google's Gemini Flash, which are accessed via API. It also states 'Compute resources used in this work were funded by a grant from the Hodgson Innovation Fund at Harvard s Department of Psychology.' However, it does not provide specific hardware details (like GPU/CPU models, memory) used by the authors for running their own analysis code.
Software Dependencies	No	The paper mentions using 'simple answer cleansing functions written in Python' and an 'open-source implementation of Bayesian multiple CPD, the Bayesian Estimator for Abrupt changes in Seasonality and Trends (BEAST)' which is 'available as an R package'. However, specific version numbers for Python, the BEAST R package, or any other critical software dependencies are not provided.
Experiment Setup	Yes	For our Forking Paths Analysis, we sample the k 10 most probable alternate tokens xt = w such that the probability of each token w is at least 5%. When sampling batches at each token index and alternate token, we collect S = 30 text samples. For (1), we collect N = 300 full text responses x from the starting index t = 0 and aggregating outcome responses R into a histogram. We used a zero-shot Co T prompt as in Kojima et al. (2022) for the first 6 tasks. For our CPD and survival analysis models, we used L2 distance. We also tested with L1 distance and K-L divergence, but found that results with d = L2 most reliably corresponded to qualitative judgments of change points in ot and ot,w. To address this, we manually tuned noise hyper-parameter α and slightly perturb yt with Gaussian noise of variance .03.