reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Compute-Optimal LLMs Provably Generalize Better with Scale

Authors: Marc Finzi, Sanyam Kapoor, Diego Granziol, Anming Gu, Christopher De Sa, Zico Kolter, Andrew Gordon Wilson

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We compute these quantities on the given Pythia checkpoints on the Pile dataset on which they were trained and quantized using GPTQ (Frantar et al., 2023) to b = 4 bits per parameter, and we evaluate the bounds with failure probability δ = 0.01. The results are shown in Figure 2 for the Chinchilla compute-optimal checkpoints within the Pythia model family. In Figure 2, we can observe several points. In Figure 2 (center), we see that the loss variation decreases with model size as 0.27 + 8337N 0.54, approximately 1/N with a constant offset. In Figure 2 (right), we break down the individual contributions to the generalization bound. Figure 2 (left) shows the comparison with the bound value Rsq and the empirical risk ˆRh.
Researcher Affiliation	Collaboration	Marc Finzi Carnegie Mellon University Sanyam Kapoor New York University Diego Granziol Pure Strength AI Anming Gu Boston University Christopher De Sa Cornell University J. Zico Kolter Carnegie Mellon University Andrew Gordon Wilson New York University
Pseudocode	No	The paper describes mathematical derivations and empirical evaluations, but it does not include any explicit pseudocode blocks or algorithms formatted like code.
Open Source Code	No	To test our theory, we use the open source Pythia model family (Biderman et al., 2023) ranging from 70 million to 12 billion parameters. Unlike other open source LLMs, we have full access to both the Pythia model checkpoints from training and the Pile dataset they were trained on (Gao et al., 2020), which is required for our analysis.
Open Datasets	Yes	To test our theory, we use the open source Pythia model family (Biderman et al., 2023) ranging from 70 million to 12 billion parameters. Unlike other open source LLMs, we have full access to both the Pythia model checkpoints from training and the Pile dataset they were trained on (Gao et al., 2020), which is required for our analysis.
Dataset Splits	Yes	We estimate the risk and loss variation on an IID subsample from the collection of token-context pairs in the training dataset of size 10^4 and bound the difference from the full training set evaluation and the 10^4 sized subsample with a simple Hoeffding bound.
Hardware Specification	No	The paper does not provide specific hardware details such as GPU models, CPU types, or memory specifications used for running the experiments.
Software Dependencies	No	We compute these quantities on the given Pythia checkpoints on the Pile dataset on which they were trained and quantized using GPTQ (Frantar et al., 2023) to b = 4 bits per parameter... We utilize the Co LA (Potapczynski et al., 2023) library to compute the spectral approximation of the Hessian.
Experiment Setup	Yes	We compute these quantities on the given Pythia checkpoints on the Pile dataset on which they were trained and quantized using GPTQ (Frantar et al., 2023) to b = 4 bits per parameter, and we evaluate the bounds with failure probability δ = 0.01. We compute Σ with K given by 1000 equally spaced points between [0, 1], excluding the endpoints. We estimate the risk and loss variation on an IID subsample from the collection of token-context pairs in the training dataset of size 10^4.