reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

How Do Large Language Monkeys Get Their Power (Laws)?

Authors: Rylan Schaeffer, Joshua Kazdan, John Hughes, Jordan Juravsky, Sara Price, Aengus Lynch, Erik Jones, Robert Kirk, Azalia Mirhoseini, Sanmi Koyejo

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We confirm this prediction empirically, raising a question: from where does aggregate polynomial scaling emerge? We then answer this question by demonstrating per-problem exponential scaling can be made consistent with aggregate polynomial scaling if the distribution of single-attempt success probabilities is heavy tailed such that a small fraction of tasks with extremely low success probabilities collectively warp the aggregate success trend into a power law even as each problem scales exponentially on its own. We further demonstrate that this distributional perspective explains previously observed deviations from power law scaling, and provides a simple method for forecasting the power law exponent with an order of magnitude lower relative error, or equivalently, 2 4 orders of magnitude less inference compute. Overall, our work contributes to a better understanding of how neural language model performance improves with scaling inference compute and the development of scaling-predictable evaluations of (multimodal) language models.
Researcher Affiliation	Collaboration	1Stanford Computer Science 2Stanford Statistics 3Speechmatics 4ML Alignment & Theory Scholars 5University College London 6Anthropic. Correspondence to: Rylan Schaeffer <EMAIL>, Sanmi Koyejo <EMAIL>.
Pseudocode	Yes	def estimate_success_rate_at_k_per_problem(n: int, c: int, k: int) -> float: """ :param n: number of total attempts on this problem. :param c: number of correct attempts on this problem. :param k: k in pass_i@$k$. """ if n c < k: return 1.0 return 1.0 np.prod(1.0 k / np.arange(n c + 1, n + 1)) Figure 8: A numerically stable unbiased estimator of passi@k, introduced by Chen et al. (2021).
Open Source Code	No	The paper includes a Python snippet for an estimator from Chen et al. (2021) in Figure 8. However, this is a description of a method they used, not an explicit statement that their own implementation code for the entire methodology described in the paper is open-source or publicly available.
Open Datasets	Yes	We specifically used Brown et al. (2024) s data of the Pythia language model family (Biderman et al., 2023) solving 128 mathematical problems from MATH Hendrycks et al. (2021) as well as Hughes et al. (2024) s data from jailbreaking frontier AI systems Claude, GPT4 (Open AI et al., 2024), Gemini (Team et al., 2024a;b) and Llama 3 8B Instruction Tuned (IT) (Grattafiori et al., 2024) on 159 prompts from Harm Bench (Mazeika et al., 2024).
Dataset Splits	No	The paper mentions using 128 mathematical problems from MATH and 159 prompts from Harm Bench for analysis. It also discusses 'subsampling the number of problems and the number of samples per problem' for backtesting on synthetic data. However, it does not specify explicit train/validation/test splits of these datasets for their own experiments in a way that allows reproduction of data partitioning.
Hardware Specification	No	The paper does not provide specific hardware details (e.g., GPU/CPU models, memory amounts) used for running its experiments or analyses. It discusses various language models and benchmarks but not the underlying computational infrastructure used by the authors for their work.
Software Dependencies	Yes	One can perform a change of variable p def = cz, but simplifying yields sums of hypergeometric functions that add little conceptual clarity and so we resort to numerical integration using Python s mpmath library (mpmath development team, 2023).
Experiment Setup	Yes	To test this understanding, we examined whether the data of Brown et al. (2024) and Hughes et al. (2024) had per-problem single-attempt success rate distributions that matched one of these simple distributions (Fig. 4). We found that the distributions could indeed be well fit by a 3-parameter Kumaraswamy(α, β, a = 0, c) distribution with scale parameter c (Fig. 4, black dashed lines); we found the scale parameter was critical to obtain good fits because the standard 2-parameter Kumaraswamy distribution is supported on (0, 1) whereas most single-attempt success distributions have a smaller maximum such as 0.01 or 0.1. To empirically test this claim, we compared the standard least squares regression estimator (in log-log space) (Hoffmann et al., 2022; Caballero et al., 2022; Besiroglu et al., 2024b) against a distributional estimator. To motivate our distributional estimator, we first need explain a key obstacle and how the distributional estimator overcomes it. The obstacle is that there are problems or prompts whose single-attempt success probabilities passi@1 lie between (0, 1/Number of Samples) such that, due to finite sampling, we lack the resolution to measure. While we do not know the true single-attempt success probability for the problems that lie in this interval, we do know how many problems fall into this left tail bucket, and we can fit a distribution s parameters such that the distribution s probability mass in the interval (0, 1/Number of Samples) matches the empirical fraction of problems in this tail bucket. Thus, our distributional estimator works by first selecting a distribution (e.g., a scaled 3-parameter Beta distribution), discretizing the distribution according to the sampling resolution 1/Number of Samples and performing maximum likelihood estimation under the discretized distribution s probability mass function.