reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Estimating the Probabilities of Rare Outputs in Language Models

Authors: Gabriel Wu, Jacob Hilton

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In Sections 4 and 5, we describe the models and input distributions on which we test our methods and convey our experimental findings. We apply our methods on three models: a 1-layer, a 2-layer, and a 4-layer transformer from Nanda & Bloom (2022). For each of the 8 distributions (listed in Table 1) and for each model, we generate ground-truth token probabilities by running forward passes on 2^32 random samples. We then select a random set of 256 tokens among those with ground-truth probabilities between 10^-9 and 10^-5, and we test all of our methods on these tokens. We give each method a computational budget of 2^16 model calls (see details in Appendix F). We measure the quality of the method with a loss function inspired by the Itakura Saito divergence (Itakura & Saito, 1968).
Researcher Affiliation	Industry	Gabriel Wu Jacob Hilton Alignment Research Center Correspondence to: EMAIL
Pseudocode	Yes	See Algorithm 1 for pseudocode. See Algorithm 2 for pseudocode. See Algorithm 3 for pseudocode. The exact procedure is described in Algorithm 4.
Open Source Code	Yes	Our code is available at https://github.com/alignment-research-center/ low-probability-estimation.
Open Datasets	Yes	All models have a hidden dimension of d = 512, a vocabulary size of \|V\| = 48262, GELU non-linearities (Hendrycks & Gimpel, 2023), and were trained on the C4 dataset (Raffel et al., 2023) and Code Parrot (Tunstall et al., 2022).
Dataset Splits	Yes	We then select a random set of 256 tokens among those with ground-truth probabilities between 10^-9 and 10^-5, and we test all of our methods on these tokens. To prevent overfitting, the methods were only run on the first four distributions during development, and they were finalized before testing on the last four distributions. The results were qualitatively the same on both halves of the split.
Hardware Specification	No	The paper mentions applying methods on 'a 1-layer, a 2-layer, and a 4-layer transformer' and provides 'a computational budget of 2^16 model calls'. However, it does not specify any particular GPU or CPU models, memory details, or other hardware components used for these computations.
Software Dependencies	No	The paper references 'transformer from Nanda & Bloom (2022)' (TransformerLens) and 'GELU non-linearities (Hendrycks & Gimpel, 2023)'. While these indicate software or architectural components, no specific version numbers for TransformerLens, Python, PyTorch, or any other libraries used for implementation are provided.
Experiment Setup	Yes	We give each method a computational budget of 2^16 model calls (see details in Appendix F). Independent Token Gradient Importance Sampling uses 2^8 batches of size 2^8, for a total of 2^16 samples. Metropolis Hastings Importance Sampling uses 2^10 + 2^11 batches of size 2^5, for a total of 1.5 * 2^16 samples (the batch size indicates the number of independent random walks the method simulates). The first 2^10 batches are used as a burn-in period for the random walk and are discarded, so only 2^16 samples are actually used to calculate the estimate. Quadratic Logit Decomposition uses n = 2^16 samples of the pre-unembed activation v. Gaussian Logit Difference uses 2^16 samples of the logit difference to estimate µ and σ, the mean and standard deviation of the difference between the target logit and the maximum logit. To tune T, we sweep over 9 different temperatures from 0.2 to 5, uniformly spaced in log-space. We choose the value of T that achieves the lowest loss on 100 randomly chosen tokens with ground-truth probabilities in the range [10^-5, 10^-3] to prevent over-fitting. We tune separate temperatures for each distribution, model size, and importance sampling method, shown in Table 5.