Forking Paths in Neural Text Generation

Authors: Eric Bigelow, Ari Holtzman, Hidenori Tanaka, Tomer Ullman

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental To test this empirically, we develop a novel approach to representing uncertainty dynamics across individual tokens of text generation, and applying statistical models to test our hypothesis. Our approach is highly flexible: it can be applied to any dataset and any LLM, without fine tuning or accessing model weights. We use our method to analyze LLM responses on 7 different tasks across 4 domains, spanning a wide range of typical use cases. We find many examples of forking tokens, including surprising ones such as punctuation marks, suggesting that LLMs are often just a single token away from saying something very different.
Researcher Affiliation Collaboration Eric Bigelow1,2,3 , Ari Holtzman4, Hidenori Tanaka2,3 *, and Tomer Ullman1,2 * 1Harvard University, Department of Psychology 2Harvard University, Center for Brain Science 3NTT Research, Physics & Informatics Lab 4University of Chicago, Department of Computer Science
Pseudocode No The paper describes a multi-stage sampling pipeline in Section 2.1 and visually in Figure 2, but it does so using descriptive text and flowcharts rather than structured pseudocode or an algorithm block with line numbers or code-like formatting.
Open Source Code Yes REPRODUCIBILITY STATEMENT All code and data used for this project is available at https://github.com/ebigelow/forking-paths.
Open Datasets Yes Coin Flip (Wei et al., 2022) is a very simple symbolic reasoning task... Last Letter (Wei et al., 2022) is more complex symbolic reasoning task... AQu A (Ling et al., 2017) and GSM8k (Cobbe et al., 2021) test mathematical reasoning... MMLU (Hendrycks et al., 2020) complex question answering dataset... Hotpot QA (Yang et al., 2018) is a complex question answering dataset... For our story generation task, we use the Story Cloze (Mostafazadeh et al., 2017) dataset...
Dataset Splits No The paper describes how subsets of data points were selected from existing datasets for analysis (e.g., 'analyzed a subset of 30 data points for each task', 'For GSM8k and MMLU, we used tiny Benchmarks (Polo et al., 2024), which are a subset of 100 examples', 'for Hotpot QA we excluded questions and answers with string length outside the [.1, .9] quantile range'), but it does not specify explicit training/test/validation splits for these datasets needed for model training reproduction, as the analysis focuses on existing LLM outputs.
Hardware Specification No The paper mentions evaluating OpenAI's GPT-3.5 and Google's Gemini Flash, which are accessed via API. It also states 'Compute resources used in this work were funded by a grant from the Hodgson Innovation Fund at Harvard s Department of Psychology.' However, it does not provide specific hardware details (like GPU/CPU models, memory) used by the authors for running their own analysis code.
Software Dependencies No The paper mentions using 'simple answer cleansing functions written in Python' and an 'open-source implementation of Bayesian multiple CPD, the Bayesian Estimator for Abrupt changes in Seasonality and Trends (BEAST)' which is 'available as an R package'. However, specific version numbers for Python, the BEAST R package, or any other critical software dependencies are not provided.
Experiment Setup Yes For our Forking Paths Analysis, we sample the k 10 most probable alternate tokens xt = w such that the probability of each token w is at least 5%. When sampling batches at each token index and alternate token, we collect S = 30 text samples. For (1), we collect N = 300 full text responses x from the starting index t = 0 and aggregating outcome responses R into a histogram. We used a zero-shot Co T prompt as in Kojima et al. (2022) for the first 6 tasks. For our CPD and survival analysis models, we used L2 distance. We also tested with L1 distance and K-L divergence, but found that results with d = L2 most reliably corresponded to qualitative judgments of change points in ot and ot,w. To address this, we manually tuned noise hyper-parameter α and slightly perturb yt with Gaussian noise of variance .03.