Estimating the Probabilities of Rare Outputs in Language Models
Authors: Gabriel Wu, Jacob Hilton
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In Sections 4 and 5, we describe the models and input distributions on which we test our methods and convey our experimental findings. We apply our methods on three models: a 1-layer, a 2-layer, and a 4-layer transformer from Nanda & Bloom (2022). For each of the 8 distributions (listed in Table 1) and for each model, we generate ground-truth token probabilities by running forward passes on 2^32 random samples. We then select a random set of 256 tokens among those with ground-truth probabilities between 10^-9 and 10^-5, and we test all of our methods on these tokens. We give each method a computational budget of 2^16 model calls (see details in Appendix F). We measure the quality of the method with a loss function inspired by the Itakura Saito divergence (Itakura & Saito, 1968). |
| Researcher Affiliation | Industry | Gabriel Wu Jacob Hilton Alignment Research Center Correspondence to: EMAIL |
| Pseudocode | Yes | See Algorithm 1 for pseudocode. See Algorithm 2 for pseudocode. See Algorithm 3 for pseudocode. The exact procedure is described in Algorithm 4. |
| Open Source Code | Yes | Our code is available at https://github.com/alignment-research-center/ low-probability-estimation. |
| Open Datasets | Yes | All models have a hidden dimension of d = 512, a vocabulary size of |V| = 48262, GELU non-linearities (Hendrycks & Gimpel, 2023), and were trained on the C4 dataset (Raffel et al., 2023) and Code Parrot (Tunstall et al., 2022). |
| Dataset Splits | Yes | We then select a random set of 256 tokens among those with ground-truth probabilities between 10^-9 and 10^-5, and we test all of our methods on these tokens. To prevent overfitting, the methods were only run on the first four distributions during development, and they were finalized before testing on the last four distributions. The results were qualitatively the same on both halves of the split. |
| Hardware Specification | No | The paper mentions applying methods on 'a 1-layer, a 2-layer, and a 4-layer transformer' and provides 'a computational budget of 2^16 model calls'. However, it does not specify any particular GPU or CPU models, memory details, or other hardware components used for these computations. |
| Software Dependencies | No | The paper references 'transformer from Nanda & Bloom (2022)' (TransformerLens) and 'GELU non-linearities (Hendrycks & Gimpel, 2023)'. While these indicate software or architectural components, no specific version numbers for TransformerLens, Python, PyTorch, or any other libraries used for implementation are provided. |
| Experiment Setup | Yes | We give each method a computational budget of 2^16 model calls (see details in Appendix F). Independent Token Gradient Importance Sampling uses 2^8 batches of size 2^8, for a total of 2^16 samples. Metropolis Hastings Importance Sampling uses 2^10 + 2^11 batches of size 2^5, for a total of 1.5 * 2^16 samples (the batch size indicates the number of independent random walks the method simulates). The first 2^10 batches are used as a burn-in period for the random walk and are discarded, so only 2^16 samples are actually used to calculate the estimate. Quadratic Logit Decomposition uses n = 2^16 samples of the pre-unembed activation v. Gaussian Logit Difference uses 2^16 samples of the logit difference to estimate µ and σ, the mean and standard deviation of the difference between the target logit and the maximum logit. To tune T, we sweep over 9 different temperatures from 0.2 to 5, uniformly spaced in log-space. We choose the value of T that achieves the lowest loss on 100 randomly chosen tokens with ground-truth probabilities in the range [10^-5, 10^-3] to prevent over-fitting. We tune separate temperatures for each distribution, model size, and importance sampling method, shown in Table 5. |