Compute-Optimal LLMs Provably Generalize Better with Scale
Authors: Marc Finzi, Sanyam Kapoor, Diego Granziol, Anming Gu, Christopher De Sa, Zico Kolter, Andrew Gordon Wilson
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We compute these quantities on the given Pythia checkpoints on the Pile dataset on which they were trained and quantized using GPTQ (Frantar et al., 2023) to b = 4 bits per parameter, and we evaluate the bounds with failure probability δ = 0.01. The results are shown in Figure 2 for the Chinchilla compute-optimal checkpoints within the Pythia model family. In Figure 2, we can observe several points. In Figure 2 (center), we see that the loss variation decreases with model size as 0.27 + 8337N 0.54, approximately 1/N with a constant offset. In Figure 2 (right), we break down the individual contributions to the generalization bound. Figure 2 (left) shows the comparison with the bound value Rsq and the empirical risk ˆRh. |
| Researcher Affiliation | Collaboration | Marc Finzi Carnegie Mellon University Sanyam Kapoor New York University Diego Granziol Pure Strength AI Anming Gu Boston University Christopher De Sa Cornell University J. Zico Kolter Carnegie Mellon University Andrew Gordon Wilson New York University |
| Pseudocode | No | The paper describes mathematical derivations and empirical evaluations, but it does not include any explicit pseudocode blocks or algorithms formatted like code. |
| Open Source Code | No | To test our theory, we use the open source Pythia model family (Biderman et al., 2023) ranging from 70 million to 12 billion parameters. Unlike other open source LLMs, we have full access to both the Pythia model checkpoints from training and the Pile dataset they were trained on (Gao et al., 2020), which is required for our analysis. |
| Open Datasets | Yes | To test our theory, we use the open source Pythia model family (Biderman et al., 2023) ranging from 70 million to 12 billion parameters. Unlike other open source LLMs, we have full access to both the Pythia model checkpoints from training and the Pile dataset they were trained on (Gao et al., 2020), which is required for our analysis. |
| Dataset Splits | Yes | We estimate the risk and loss variation on an IID subsample from the collection of token-context pairs in the training dataset of size 10^4 and bound the difference from the full training set evaluation and the 10^4 sized subsample with a simple Hoeffding bound. |
| Hardware Specification | No | The paper does not provide specific hardware details such as GPU models, CPU types, or memory specifications used for running the experiments. |
| Software Dependencies | No | We compute these quantities on the given Pythia checkpoints on the Pile dataset on which they were trained and quantized using GPTQ (Frantar et al., 2023) to b = 4 bits per parameter... We utilize the Co LA (Potapczynski et al., 2023) library to compute the spectral approximation of the Hessian. |
| Experiment Setup | Yes | We compute these quantities on the given Pythia checkpoints on the Pile dataset on which they were trained and quantized using GPTQ (Frantar et al., 2023) to b = 4 bits per parameter, and we evaluate the bounds with failure probability δ = 0.01. We compute Σ with K given by 1000 equally spaced points between [0, 1], excluding the endpoints. We estimate the risk and loss variation on an IID subsample from the collection of token-context pairs in the training dataset of size 10^4. |